Stop talking about “statistical significance and practical significance”

You’ve heard it a million times already: “statistical significance” is not the same as “practical significance.” The idea you’re supposed to have in mind, I think, is some effect size estimated at 0.003 with a standard error of 0.001. It’s statistically significant but not practically significant! I’m assuming here that these numbers have some interpretable scale, for example normalized based on total sales or average test score.

But what if that 0.003 is practically significant, for example a 3/10th of 1 percent decrease in heart attack deaths would still save 2000 lives? Then what do I say? At this point my concern is that an observed effect in some particular setting might not generalize so well to other scenarios, other people, other times. An effect of +0.003 for these people in this setting might be -0.002 for another group next year, etc.

My point is that, for the purpose of inference, what’s relevant here is not so much the practical significance of the effect size but rather its variation. Effects are not constant, and at some point an effect will be smaller than the variation you might expect to see.

The other reason I don’t like the “statistically significant but not practically significant” thing is that it can be taken to imply that larger effect size estimates are better. But that’s not always true. It depends on the uncertainty. Recall the backpack fallacy and that ridiculous claim that beautiful parents are 36% more likely to have girls. That claim would indeed have practical significance if it were true—but there’s no evidence for it (sorry, Freakonomics guys!).

P.S. This general topic has come up before, for example:

Problems with the jargon “statistically significant” and “clinically significant”

What’s misleading about the phrase, “Statistical significance is not the same as practical significance”

Statistical significance, practical significance, and interactions

P.P.S. More here from Megan Higgs.

“Unrepresentative big surveys significantly overestimated US vaccine uptake,” and the problem of generalizing from sample to population

First the story, then the background, then my final thoughts.

1. The story

Valerie Bradley writes:

Our (me, Shiro Kuriwaki, Michael Isakov, Dino Sejdinovic, Xiao-Li Meng, Seth Flaxman) recent article in Nature shows that two large online surveys of US adults consistently overestimated first-dose COVID-19 vaccine uptake relative to counts released by the CDC throughout the spring of 2021. By May 2021, Delphi-Facebook’s COVID-19 Trends and Impact Survey (collecting about 250,000 responses per week) and the Census Bureau’s Household Pulse Survey (about 75,000 responses every 2 weeks) overestimated first dose uptake by 17 percentage points (14–20 with 5% benchmark imprecision) and 14 pp (11–17 with 5% benchmark imprecision), respectively, relative to the CDC’s historically-adjusted estimates released May 26, 2021. These errors are orders of magnitude larger than the uncertainty intervals reported by each survey, which are miniscule due to the large sample sizes. Meanwhile, a more traditional survey run by Axios-Ipsos, with only about 1000 responses per week, provides more reliable estimates with reasonable uncertainty. We use Xiao-Li’s framework to decompose the error in each survey and show how a survey of 250,000 respondents can produce an estimate of the population mean that is no more accurate than an estimate from a simple random sample of size 10. Our central message is that data quality matters more than data quantity, and that compensating the former with the latter is a mathematically provable losing proposition.

We’ve gotten some great feedback since publication, which we thought would be useful to discuss more broadly:

1. Is this actually a paradox?

This question was raised by Andrew, who wrote:

“The only thing I don’t really understand about this particular article is why they talk about a ‘Big Data Paradox.’ I agree with their message “that data quality matters more than data quantity,” but I don’t see this as a paradox. Maybe the paradox is in the way that surveys are reported? They say that increasing data size ‘magnifies the effect of survey bias’” but that seems to miss the point: increasing data size does not, in and of itself, increase bias. For example, if the two surveys they are criticizing (Delphi and Household Pulse) were smaller, they’d still be just as biased, right? So I don’t get it when they say ‘small biases are compounded as sample size increases.’ I’d think it would be more accurate to say that biases persist even as sample size increases, not that they are compounded.”

Shiro wrote back:

“‘biases persist even as sample size increases’ is exactly right. We’d say it’s also compounded in the sense that it becomes a larger portion of the total error — technically, the Mean Square Error (MSE, Bias^2 +Variance). MSE is also the metric by which we quantify “accurate” in the abstract when we write ‘[unrepresentative large surveys] are no more accurate than … [a SRS of size 10] for estimating a population mean.’”

Xiao-Li, who wrote the original 2018 paper on the Big Data Paradox, added,

“I used the term “paradox” in the same way as in “Simpson’s paradox”, which, as you know, is not a mathematical paradox at all, but rather people’s misperceptions from comparing apples with oranges created by confounding factors. The “Big Data Paradox” refers to the phenomenon that the larger the size, the smaller the error bar according to the traditional calculations, but when the answers are biased, such diminishing error bars would increasingly mislead us. In this sense, it appears to many as paradoxical, since the more data, the more we get misled. But as you know, this is all due to the incorrect calculation of the error bars, which should be the total error, not just the sampling error.”

As one intuition for this paradox it’s easy to convince oneself of the following. A 99% sample of the population will produce estimates with almost no error, even if the sampling was somewhat biased; therefore if one can collect Big Data that approaches that 99% sample of the population, estimates based on that sample should become increasingly correct. However, despite this intuition, overcoming even small bias in data collection is incredibly difficult with Big-but-not-99% data. Meng (2018) describes this using the Law of Large Populations (as opposed to the Law of Large Numbers): the theoretical result that the error of the estimate increases asymptotically with population size N, all else fixed.

2. The CDC data isn’t perfect – how does this affect the results?

Great point. The potential for non-sampling error in the CDC benchmark is why we have included +/-5% and +/-10% “Benchmark Imprecision Intervals” in our analysis. The 5% and 10% intervals stem from our analysis (ED Figure 3 in the article) of changes in the CDC’s estimates of first dose uptake as reports are updated to account for reporting delays. These delays have already been accounted for in the CDC data we use, so the benchmark imprecision intervals allow for additional non-sampling error that is yet undiscovered. There have been reports that the CDC may overcount first-dose vaccine uptake, which would affect our analysis, but it would make Census Household Pulse and Delphi-Facebook error larger than what we have estimated.

3. Delphi-Facebook wasn’t designed to estimate vaccine uptake at a particular point in time (we have the CDC for that!), but rather to measure changes over time in other COVID-related outcomes, so how much does this analysis affect those primary outcomes?

It is certainly that Census Household Pulse and Delphi-Facebook were not designed to measure snapshots of COVID vaccine uptake, but rather to collect data to perform fine-grained spatiotemporal analysis of COVID-like symptoms and other outcomes.

However, we address the implications of our analysis on these other outcomes in the following portion of the “Addressing Common Misperceptions” section of our article:
“One might hope that surveys biased on vaccine uptake are not biased on other outcomes, for which there may not be benchmarks to reveal their biases. However, the absence of evidence of bias for the remaining outcomes is not evidence of its absence. In fact, mathematically, when a survey is found to be biased with respect to one variable, it implies that the entire survey fails to be statistically representative.”

Furthermore, unfortunately neither Delphi-Facebook nor Census Household Pulse actually accurately track changes in COVID vaccination status over time. The plot below extends our analysis to December and shows how first-dose uptake estimated by the two surveys plateaus in late July 2021, while CDC estimates keep increasing. According to Delphi-Facebook and CHP, no US adults have gotten first doses since late summer.

One may look at the plot above and infer that Delphi-Facebook and Census Household Pulse estimates are improving, since they overlap with CDC counts in late November 2021. However, as Xiao-Li writes, this “is an illusion since any two lines will cross somewhere as long as they are not in parallel.” In other words, we know from the demographic distributions of respondents and the rate of change of vaccination status during this period that the respondent populations of Census Household Pulse and Delphi-Facebook are different from that of US adults. It may be true that by late November 2021, the population of respondents happened to have the same vaccination rate as that of US adults, but that coincidence does not prove the reliability of the surveys.

This does not mean that CHP and Delphi-Facebook are not still useful sources of information about COVID. However, we will reiterate our words of caution that given that we know the surveys are not statistically representative, researchers using the data should make an effort to justify why it is still useful for their analysis, or how they have accounted for the bias. A great example of this can be found in Lessler et al. (2021) which cites a technical report from Lupton-Smith et al. (2021) validating the Delphi-Facebook survey for their particular use case (in-person schooling).

Speaking generally, one of the core problems of statistics is generalizing from available data to the population of interest. I still don’t really get the whole “big data paradox” thing, but I agree that there’s no reason to think that a sample, just because it’s large, will be representative of the population of interest. Indeed, there can be more than one population of interest, and a sample can’t really be representative of more than one population! As statisticians, we should probably be spending less time thinking about drawing balls from urns and more time thinking about how our samples differ from the populations we’re interested in.

2. The background

Last week a journalist sent me an email:

I am writing about this new paper in Nature [by Valerie Bradley et al.] that criticizes the methodology of Covid surveys by Census and Delphi/Facebook.

One of the authors, Seth Flaxman, pointed me to something you wrote about the Household Pulse Survey.

It turned out that I’ve collaborated with several of the authors of this new paper, and I responded as follows:

The article looks reasonable to me. I know three of the authors and am ccing them in case they have additional thoughts.

The basic claim—that the traditional survey with adjustments obtained a more representative sample than the two other surveys—is plausible, and I agree with their focus on total survey error as being more important that narrow measures of sampling error. My colleagues and I have found the same thing in political polls. Total survey error has been discussed for a long time in the survey sampling world, and it’s certainly an idea that’s well understood by the Census. However, as Bradley et al. discussed, in practice people often forget about nonsampling error (and that came un in that blog discussion you linked to also).

The only thing I don’t really understand about this particular article is why they talk about a “Big Data Paradox.” I agree with their message “that data quality matters more than data quantity,” but I don’t see this as a paradox. Maybe the paradox is in the way that surveys are reported? They say that increasing data size “magnifies the effect of survey bias,” but that seems to miss the point: increasing data size does not, in and of itself, increase bias. For example, if the two surveys they are criticizing (Delphi and Household Pulse) were smaller, they’d still be just as biased, right? So I don’t get it when they say “small biases are compounded as sample size increases.” I’d think it would be more accurate to say that biases persist even as sample size increases, not that they are compounded.

But I guess this doesn’t affect their main point, which is that those two particular surveys had problems.

The authors of the article had more to say, and this blog seemed like a good place for the discussion. There’s more space on a blog than in the newspaper.

3. My final thoughts

Getting back to the big data paradox, I guess I’d say two things:

1. Big data are typically messy data, available data not random samples. So I agree that, in practice, bias can be a bigger problem with big data than with traditional surveys. Then again, small data can be really biased too, and some prominent researchers don’t see the problem at all. So maybe safest to just say that we should always be concerned with discrepancies between sample and population, whatever the sample size.

2. Remember the most important formula in statistics. Larger sample size corresponds to smaller sampling variance, so that bias becomes more important. In the surveys discussed by Bradley et al., though, all the sample sizes were large, and really all that mattered in any of them were nonsampling error. Which is consistent with their message not to simply use sample size as a measure of survey quality.

P.S. Keith puts it well in comments:

With a moderate amount of bias confidence interval coverage will decrease with increasing sample size.

Patterns on the complex floor: A fun little example of simulation-based experimentation in mathematics

Calling it a “fun little example” might sound disrespectful, but it’s not. Examples are not hard to come by, but “fun” and “little” are special. Just as it’s said that it can take a lot of work to write something concise, it can take a lot of understanding to demonstrate an important nontrivial point with an example that is fun and little.

As regular readers know, we’ve been pushing simulation-based experimentation (formerly called fake-data simulation) for a long time. I think simulation-based experimentation is usually the best way, by far, of understanding statistical procedures.

One thing I haven’t though about so much is how the same idea can work in math—even for problems that have no inherent probabilistic structure.

The general idea goes as follows. Suppose you have a mathematical conjecture. It might be true or it might be false, you don’t know, that’s why it’s a conjecture, not a theorem. You can try to prove it for some special cases and you can also look for counterexamples in various places. This discussion of cases and places suggests a meta-model in which there is some space of possible values. Consider a space Theta and a conjecture C(theta). The conjecture is a true theorem if C(theta) is true for all theta in Theta.

At this point you can do simulated-data experimentation by sampling theta from some distribution and evaluating C at each sampled value. If C is true for enough values, you can think about making probabilistic statements about your theorem, but even if this is not the case you can learn from the pattern of truth.

I was moved to think about all this after reading this great post by John Cook demonstrating the use of this idea. Cook’s post is great for two reasons:

1. His example is just on the border of triviality. As noted at the top of this post, “near-trivial” could sound bad, but I’m intending it to be a compliment. Trivial examples are fine too, but this one has enough complexity to be non-trivial while being simple enough that once you see the pattern, it’s kinda clear. It’s similar to our golf example (see section 10 here). Here’s a relevant discussion: Why we kept the trig in golf: Mathematical simplicity is not always the same as conceptual simplicity.

2. He doesn’t just demonstrate simulation-based experimentation, he also goes through its workflow. I illustrate through some excerpts from Cook’s post:

I [Cook] wrote some Python code to test this, and to my great surprise, the identity often holds. In my first experiment, it held something like 84% of the time for random inputs. I expected it would either rarely hold, or hold say half the time (e.g. in a half-plane).

My first thought was to plot 1,000 random points, green dots where the identity holds and red stars where it does not. This produced the following image.

I like how he presents results in the context of his expectations.

Cook continues:

Since I’m sampling uniformly throughout the square, there’s no reason to plot both where the identity holds and where it doesn’t. So for my next plot I just plotted where the identity fails.

The dots on those graphs are too large, but, hey, nobody’s perfect! For some reason lots of people like to make graphs with dots that are too large.

More from Cook:

That’s a little clearer. To make it clearer, I rerun the plot with 10,000 samples. (That’s the total number of random samples tested, about 16% of which are plotted.)

Then to improve the resolution even more, I increased the number of samples to 1,000,000.

It might not be too hard to work out analytically the regions where the identity holds and doesn’t hold. The blue regions above look sorta like parabolas, which makes sense because square roots are involved. And these parabola-like curves has ziggurat-like embellishments, which makes sense because floors are involved.

I like that—the connection between the graph and the theoretical model.

And there’s still more, including a final graph but I’ll let you read the whole thing for that. My point here is we learn not just from the example but also from the workflow, the steps that got us there.

P.S. I scheduled this post for this day because the example is a fun present all wrapped up, also because it seems that Cook is Christian so he might appreciate this gesture. Blogs are so cool. It’s Christmas every day, with strangers offering us amazing and unexpected gifts from all over the world. The internet is more than just people screaming at each other, it’s also people going to the trouble of preparing these delicious treats and sharing them with anyone who cares to stop by.

Estimates of “false positive” rates in various scientific fields

Uli Schimmack writes:

I am curious what you think about our recent attempts to estimate the false discovery risk (maximum rate under assumption of 100% power) based on estimates of a bias-corrected discovery rate? We applied this method to medicine (similar to Jager and Leek, 2014) and psychology (hand-coding). Results are very similar with an FDR estimate between 10 and 20 percent. Based on results we recommend an alpha of .01 to maintain a long-run FDR below 5%.

He points to these two posts:

Most published results in medical journals are not false (with Frantisek Bartos)

Estimating the False Positive Risk in Psychological Science

My quick reply is I don’t find the true-positive, false-positive framework very helpful. Here’s what Keith O’Rourke and I wrote in our comment on the Jager and Leek paper:

Jager and Leek may well be correct in their larger point, that the medical literature is broadly correct. To answer such a question would require additional care in defining what it means for a study to be correct. The research paradigm of effects being either zero or non-zero is not, we believe, particularly helpful, for two reasons. First, we almost always care about the direction of an effect, not merely its existence. Second, the magnitude of any effects or comparisons are also important and, in fact, connect directly to concerns about replicability of scientific phenomena.

Medical researchers are mostly studying real effects (setting aside certain wacky examples and desperate clinical research areas involving high mortalities). But there is a lot of variation. A new treatment will help in some cases and hurt in others. Also, studies are not perfect . . .

Our point about Type 1 errors is not primarily “semantics” or “philosophy”. The framework of the Jager and Leek paper under discussion is admirably clear—our problem is that we do not think it applies well to reality. We have a problem with the identification of scientific hypotheses as statistical “hypotheses” of the “θ = 0” variety. We understand that the authors chose to follow the model used by the much-cited Ioannidis (2005), but that does not recuse them from dealing with the logical difficulties involved with that model.

That said, I recognize that many people do think in these terms, so I’m linking to these posts of Schimmack as they may interest some of you.

“Why and How We Should Join the Shift From Significance Testing to Estimation”

Valentin Amrhein writes:

Daniel Berner and I [Amrhein] have a new manuscript on “Why and how we should join the shift from significance testing to estimation” in ecology and evolutionary biology, written for the audience of applied researchers in those fields. We basically repeat what we all said and wrote elsewhere, and included some simulations on the P-value distribution, power, and effect size inflation / deflation:

Before looking at the simulations, I actually wasn’t aware that the downwards bias in ‘non-significant’ estimates gets stronger with higher statistical power – maybe we could call this the loser’s curse (which might be a winner’s curse as well if people are explicitly looking for non-significant outcomes).

The paper is not as long as it seems because appendices and code are included in the pdf – if anyone has time to briefly skim it and save us from the worst mistakes it would be greatly appreciated.

Erik van Zwet added:

I had a quick look and it all seems very sensible. I agree that if a paper does lots and lots of tests it should be considered exploratory research. In such a setting, however, I do think the p-value can be useful for some quick scanning. Of course the problems happen when significant p-values are viewed as “discoveries” and insignificant ones as “no effect”.

As I mentioned earlier, the winner’s and loser’s curses really don’t have that much to do with significance or non-significance. If “p<0.05” would be banned forever, the “curse” would still be there. It’s just that unconditional properties like unbiasedness and coverage become meaningless or even misleading after the data are in. A frequentist might shrug, and say that there’s simply nothing more to be done once you condition on the data, because there’s no more randomness left. That’s not a valid argument from the Bayesian point of view, but it’s even problematic from the frequentist point of view if you condition only on the event p<0.05. Then there’s still some randomness left, and you can actually see what happens to the bias of your estimator (i.e. the winner’s curse) and the coverage of your confidence interval. Apart from these philosophical musings, I was wondering if you collected all those p-values from those 48 papers. If you did, I’d like to try to estimate the distribution of the actual power.

I asked Valentin and Erik if it would be ok to quote them on the blog and they agreed. Valentin added:

Do you think there would be a way for this blog post to appear faster than with your usual time lag? We submitted the paper last week to the Journal of Evolutionary Biology, and it would be good to get feedback for the revision.

So here it is! You can leave your feedback for them in the comments.

“The obesity wars and the education of a researcher: A personal account”

Asher Meir points us to this article by Katherine Flegal, who writes:

A naïve researcher [Flegal] published a scientific article in a respectable journal. She thought her article was straightforward and defensible. It used only publicly available data, and her findings were consistent with much of the literature on the topic. Her coauthors included two distinguished statisticians. To her surprise her publication was met with unusual attacks from some unexpected sources within the research community. These attacks were by and large not pursued through normal channels of scientific discussion. Her research became the target of an aggressive campaign that included insults, errors, misinformation, social media posts, behind-the-scenes gossip and maneuvers, and complaints to her employer. The goal appeared to be to undermine and discredit her work. The controversy was something deliberately manufactured, and the attacks primarily consisted of repeated assertions of preconceived opinions. She learned first-hand the antagonism that could be provoked by inconvenient scientific findings. . . .

Wow. It’s good to see a researcher be open about this sort of thing. There’s a lot of incentive not to rock the boat or to be known as a complainer: after all, you want to be known for your work, not for being victimized. So I appreciate that Flegal just lays it all out. It’s just her perspective, but that’s fine. She gives details and references.

I was curious if this Flegal et al. (2005) paper had ever come up on our blog, so I googled, and . . . yes:

From 2007: Being Overweight Isn’t All Bad, Study Says

From 2013: Thin scientists say it’s unhealthy to be fat

At no point did I try to evaluate the claims or the scientific debate. That would take work!

But, yeah, thin people are the worst.

The NFL regression puzzle . . . and my discussion of possible solutions:

Alex Tabarrok writes:

Here’s a regression puzzle courtesy of Advanced NFL Stats from a few years ago and pointed to recently by Holden Karnofsky from his interesting new blog, ColdTakes. The nominal issue is how to figure our whether Aaron Rodgers is underpaid or overpaid given data on salaries and expected points added per game. Assume that these are the right stats and correctly calculated. The real issue is which is the best graph to answer this question:

Brian 1: …just look at this super scatterplot I made of all veteran/free-agent QBs. The chart plots Expected Points Added (EPA) per Game versus adjusted salary cap hit. Both measures are averaged over the veteran periods of each player’s contracts. I added an Ordinary Least Squares (OLS) best-fit regression line to illustrate my point (r=0.46, p=0.002).

Rodgers’ production, measured by his career average Expected Points Added (EPA) per game is far higher than the trend line says would be worth his $21M/yr cost. The vertical distance between his new contract numbers, $21M/yr and about 11 EPA/G illustrates the surplus performance the Packers will likely get from Rodgers.

According to this analysis, Rodgers would be worth something like $25M or more per season. If we extend his 11 EPA/G number horizontally to the right, it would intercept the trend line at $25M. He’s literally off the chart.

Brian 2: Brian, you ignorant slut. Aaron Rodgers can’t possibly be worth that much money….I’ve made my own scatterplot and regression. Using the exact same methodology and exact same data, I’ve plotted average adjusted cap hit versus EPA/G. The only difference from your chart above is that I swapped the vertical and horizontal axes. Even the correlation and significance are exactly the same.

As you can see, you idiot, Rodgers’ new contract is about twice as expensive as it should be. The value of an 11 EPA/yr QB should be about $10M.

Alex concludes with a challenge:

Ok, so which is the best graph for answering this question? Show your work. Bonus points: What is the other graph useful for?

I posted this a few months ago and promised my solution. Here it is:
Continue reading

Wilkinson’s contribution to interactive visualization

This is Jessica. Upon learning this morning that Lee Wilkinson passed away I also felt compelled to write something on the extent to which his work has influenced interactive visualization research. 

The Grammar of Graphics was an incredibly ambitious undertaking – Wilkinson set out to create a system that could produce any statistical graphic he’d ever seen, and that could deepen understanding of the meaning of graphics. The GoG demonstrates the minimum set of components necessary to generate a statistical graphic, under an understanding that a graph is a function: data, algebra, scales, statistics, geometry, coordinates, and aesthetics. I often tell students I teach that in visualization research we hate chart taxonomies, and GoG is perhaps the best demonstration of how much more deeply we can think about visualization. Wilkinson pointed out the “deep structure” in visualization, observing, for instance, that a pie chart is just the result of passing rectangular marks through a polar transformation. He was inspired by Bertin’s work on graphical symbolism, and GoG systematizes thinking about the design space of visualizations in a way that is ultimately generative as well. You might be able to use an interactive implementation of it to make some crazy graphs but nothing that isn’t meaningful. 

From what Wilkinson describes (e.g., in this recent podcast) some of the hard work behind the Grammar of Graphics was in the editing: making sure the system was complete and correct while also minimally complex. There are only three operators in the algebra – cross, nest, and blend – but they suffice. Tableau’s underlying table algebra and ggplot2 are examples of major components of the grammar used in today’s most popular visualization tools. At the same time, the book covers uncertainty, time, graph drawing, interactive control, and just about every other major branch of visualization research in some way, synthesizing important distinctions that would otherwise take someone a while to glean from the literature. 

My own admiration for Grammar of Graphics is partly why I chose to get into visualization back as a grad student. I remember thinking his concept of a frame was really important but underappreciated in any discussions I’d heard about visualization. I read it for the first time as a Ph.D. student and have been calling it my favorite book for years. Whenever I go back to reread chapters I always come away with some new appreciation. I even bring in a copy to pass around in my interactive visualization course, trying to get students to sense its influence and hopefully read it. Just looking at the examples is like an education in visualization.

I didn’t know Lee well at all, but recall meeting him for the first time at the IEEE VIS doctoral colloquium, back in 2012. I remember he came in very late, but just in time for my presentation on uncertainty visualization, and his enthusiasm for the ideas (basically hypothetical outcome plots) was the highlight of my week. A few years later I remember talking to him at Tableau about the same ideas, and that time he was more critical, arguing that they would never catch on, but I appreciated that too. Lee has been critical of the visualization research community at times over the years, and while it was sometimes it was tough to hear, it was always clear he cared deeply and was optimistic about the field’s progress and open-minded intellectual attitude. His perspective will be missed.

Eliciting expert knowledge in applied machine learning

This is Jessica. I was going to blog this about elicitation a few days ago, and then before I got to publishing it Aki brought up elicitation of priors for Bayesian analysis. Elicitation is a topic I started thinking about a few years ago with Yea Seul Kim, where we were focusing mostly on graphical elicitation of prior beliefs for visualization interaction or analysis. The new paper by Aki and others is a great summary of the many hard questions one can run into in eliciting knowledge. As Tony O’Hagan, who’s written extensively on elicitation, has said “Eliciting expert knowledge carefully, and as scientifically as possible, is not simply a matter of sitting down with one or more experts and asking them to tell us what they think.” But while we know the elicitation process matters, it can be very difficult to evaluate whether you’ve gotten the “right” beliefs from someone. There’s undoubtedly some effect of asking them in the first place, and a danger that the elicitation process hallucinates an unwarranted amount of detail or bias in representing their knowledge. For me it’s been the kind of topic where the more I work on it, the less confident I feel that it is working.   

Anyway, prior elicitation is just one relatively well studied form of elicitation. Dan Kerrigan, Enrico Bertini and I recently looked at a sample of papers dealing with applied machine learning papers whose modeling contributions involve integrating knowledge gained from domain experts. There’s some precedent for looking at elicitation in what are called “expert systems,” which tends to be associated with developing knowledge bases and rule-based approaches to mimic expert decision making, popular in the 80s and into the 90s. We consider knowledge elicitation broadly in the context of fully automated and mixed initiative or human-in-the-loop predictive models, where there’s often some acknowledgment of the need to work with domain experts to make sure the models solve the right problem and gain intrinsic trust in the sense of aligning with the experts’ causal models, etc. 

Eliciting knowledge from domain experts can play an important role throughout the machine learning process, from correctly specifying the task to evaluating model results. However, knowledge elicitation is also fraught with challenges. In this work, we consider why and how machine learning researchers elicit knowledge from experts in the model development process. We develop a taxonomy to characterize elicitation approaches according to the elicitation goal, elicitation target, elicitation process, and use of elicited knowledge. We analyze the elicitation trends observed in 28 papers with this taxonomy and identify opportunities for adding rigor to these elicitation approaches. We suggest future directions for research in elicitation for machine learning by highlighting avenues for further exploration and drawing on what we can learn from elicitation research in other fields.

We looked at a set of papers (28) that were published in ML or related areas in the last 25 years and specifically mentioned domain knowledge elicitation. In getting to these we sifted through a much larger set of ML related papers that suggested some integration of domain knowledge, but ruled out at least as many as we report on because they didn’t motivate and describe the elicitation of expert knowledge or they didn’t demonstrate the use of the elicited knowledge in a model or system. (PS If you have paper suggestions please post them in the comments, we’re interested in growing the list for future reference).   

Domain knowledge can play many roles in an ML pipeline, so among those papers that did describe elicitation in some detail, we differentiate the high level goal for each described form of elicitation, including problem definition (figuring out the task, how humans currently do it, what the relevant data sources are, how to evaluate performance), feature engineering (feature relevance, transformations, etc.), model development (eliciting rules, constraints, or feature relationships), and model evaluation, not very prevalent in the sample but cases where domain expertise was used to assess the model’s performance and validate its results to improve the model. 

We also break down elicitation processes into the target of elicitation (e.g., instances, feature relevance, model constraints), process details like the medium (e.g., meetings or interviews, computer app), whether the elicitation prompt is well-defined (e.g., are there clear questions or tasks that are described as being put to the expert), whether any context given to the expert to establish common ground is described (which might include describing elicitation goals, model details, notable instances, or other information to “grease the wheel”), how well structured the prompts are, how constrained the experts’ responses are, if potentially unreliable or erroneous responses were accounted for in modeling (e.g., were expert labels treated as probabilistic?), and whether or not extracted knowledge was validated (checked for consistency or reliability in some way, even if informally like by recapitulating to the expert what one learned from an interview; notably this was absent in 75% of the papers we looked at). We look at whether there’s any pre-processing required before integrating into model development, and if so, whether it’s described, and how well defined the use of the information is in the model development pipeline. Was there a predefined and unambiguous process for applying the expert knowledge in the model development pipeline? Examples where the use of the elicited knowledge is well defined would include, for example, eliciting priors on edges for a Bayesian network, incorporating a monotonicity constraint to capture the expert’s mental model of how a feature behaves, or asking experts to label certain instances for training the model. An example where it’s not would include reporting on holding workshops with domain experts like doctors to understand what they found challenging about certain diagnoses but then not saying how exactly the information that was gained informed the model.  

Here’s a diagram showing how things break down in terms of goals, different types of process details, and use of the elicited info. We have to recall that these are the papers that provided enough detail to code in the first place, but overall it seems there’s a fair number of cases where the medium used to collect information isn’t specified, and tendencies to overlook how the expert is guided to think about the goals or model details as they provide information (context), to treat the elicited knowledge as given rather than formally account for uncertainty (unreliable responses not accounted for), and to not worry about validating responses. Sankey diagram showing patterns in elicitation processes in applied ml papers

Maybe the informalness of elicitation processes that these results suggest (at least when we’re not talking about more conventional uses we saw like eliciting labels or feature relevance) isn’t that surprising, since ML has traditionally de-emphasized the human tweaking and interfacing parts. It’s still relatively recently in the history of AI/ML that ideas like intelligence augmentation, human-in-the-loop, etc. have become mainstream. Maybe the best of the old expert systems elicitation approaches should be dusted off and covered more in ML curricula.  

If a bigger analysis were to find similar patterns, how important it is to increase the emphasis on more rigorous elicitation in ML? We argue that more rigor (and more development and validation of elicitation focused methods or tools) would be a worthy investment, for reasons of efficiency and transparency. If you’re not doing any validation or thinking carefully about prompting, establishment of common ground, etc. it seems easy to end up with some valid elicited knowledge and some hallucinated bits. And whenever the integration of domain knowledge is used as a selling point for the research contribution, it’s fair to expect a reproducible process that defines the target of elicitation, how the knowledge will be used, and how responses will be evaluated, rather than figuring it all out as we go. 

DeclareDesign

Political scientists Graeme Blair, Jasper Cooper, Alexander Coppock, and Macartan Humphreys write:

DeclareDesign is a system for describing research designs in code and simulating them in order to understand their properties. Because DeclareDesign employs a consistent grammar of designs, you can focus on the intellectually challenging part – designing good research studies – without having to code up simulations from scratch. DeclareDesign is based on the Model-Inquiry-Data Strategy-Answer Strategy (MIDA) framework for describing designs and a declare-diagnose-redesign workflow for improving research designs before implementing them.

I haven’t used DeclareDesign myself, but I imagine it could be very helpful for researchers. Here’s an example with the 8 schools.

When confidence intervals include unreasonable values . . . When confidence intervals include only unreasonable values . . .

Robert Kaestner writes:

Economists’ love affair with randomized controlled trials (RCTs) is growing stronger by the day.

But what should we make of an RCT that produces a point estimate and confidence interval that largely includes values that most would consider implausible?

The Goldin et al. article on effects of health insurance on mortality (QJE) provides a good example. Point estimate suggests that 6 months of extra insurance coverage reduces mortality by 100%. Most of the confidence interval includes possible estimates there also implausible.

Of course the confidence interval is wide and includes smaller, perhaps plausible values.

My first response is this has nothing to do with randomized clinical trials; it’s just a general issue about confidence intervals, which Sander Greenland refers to as “compatibility intervals” because, when they work like they’re supposed to, they give an interval of parameter values that are compatible with the data.

But there are problems with this idea.

The first problem, as noted above, confidence intervals can contain lots of values that may be compatible with the data but which make no sense, i.e., they’re not compatible with our prior information. But that’s why we use Bayesian methods. Throw in your prior and these outlandish estimates should no longer be in the interval—unless the data and model really really support them.

The second problem is that confidence intervals can exclude reasonable values that are compatible with the data. We saw this with the notorious beauty-and-sex-ratio example, where the 95% confidence interval was entirely composed of unreasonable parameter values. The trouble here is that you can get unlucky, or you can sift through your data to get unlucky on purpose, as it were, but finding confidence intervals that just happen to only include unreasonable values.

Ultimately the problem is that we have uncertainty, and it’s a mistake to take an estimate—even an interval estimate—as a statement of certainty.

P.S. Just to clarify: It’s my impression that the usual way to work with confidence intervals is to take the 95% interval and act as if it contains the true value, with the understanding that you’ll be wrong 5% of the time. My criticism of this approach is the usual Bayesian criticism, that sometimes you know the interval entirely contains bad values, you know the true value is in the interval with something like 0% probability. What are you supposed to do then—just sit there and take it? That way lies Zillow madness.

Importance of understanding variation when considering how a treatment effect will scale

Art Owen writes:

I saw the essay, “Nothing Scales,” by Jason Kerwin, which might be a good topic for one of your blog posts. Maybe a bunch of other people sent it to you already.

He seems to think we just need more and better data and methods to get things to generalize/scale. It’s not clear to me that we’ll get enormously better data per subject on education or behavior. Maybe we will get better sets of subjects (more coverage) in a more complex and expensive study.

The post in question is by an economist who is emphasizing the importance of varying treatment effects. This is something that people been talking about for awhile, but, as with many things in statistics, this is something that we each have to rediscover on our own. It’s said that the best way to learn something is to teach it, so it’s good to see Kerwin’s discussion, which he develops in the context of a real example. And I appreciate that he refers to 16.

Just a couple minor things. Kerwin writes:

Treatment effect heterogeneity also helps explain why the development literature is littered with failed attempts to scale interventions up or run them in different contexts. Growth mindset did nothing when scaled up in Argentina. Running the “Jamaican Model” of home visits to promote child development at large scale yields far smaller effects than the original study. The list goes on and on; to a first approximation, nothing we try in development scales.

Why not? Scaling up a program requires running it on new people who may have different treatment effects. And the finding, again and again, is that this is really hard to do well. . . .

I’m with him on the importance of varying treatment effects, but, when it comes to explaining why estimated effects don’t replicate at their published magnitudes, I think he’s missing the big point that published estimates tend to be overestimates because of the winner’s curse (selection bias); see for example here. Also he writes, “None of the techniques we use to look at treatment effect variation currently work for non-experimental causal inference techniques.” That’s not true at all! Plain old regression with interactions works just fine, or you can break out the nonparametrics as with Hill (2011).

Again, I like Kerwin’s main point, which is that when considering how a treatment will scale in the real world, it’s important to think about treatment effect variation, not just as a mathematical concept (correcting for “heteroscedasticity” or whatever) but substantively. I also agree with what Caroline Fiennes writes in comments, that it’s important to know what is the cost of an intervention and what exactly the intervention is.

“Have we been thinking about the pandemic wrong? The effect of population structure on transmission”

Philippe Lemoine writes:

I [Lemoine] just published a blog post in which I explore what impact population structure might have on the transmission of an infectious disease such as COVID-19, which I thought might be of interest to you and your readers. It’s admittedly speculative, but I like to think it’s the kind of speculation that might be fruitful. Perhaps of particular interest to you is my discussion of how, if the population has the sort of structure my simulations assume, it would bias the estimates of causal effects of interventions. This illustrates a point I made before, such as in my discussion of Chernozhukov et al. (2021), namely that any study that purports to estimate the causal effect of interventions must — implicitly or explicitly — assume a model of the transmission process, which makes this tricky because I don’t think we understand it very well. My hope is that it will encourage more discussion of the effect population structure might have on transmission, a topic which I think has been under-explored, although other people have mentioned the sort of possibility I explore in my post before. I’m copying the summary of the post below.

– Standard epidemiological models predict that, in the absence of behavioral changes, the epidemic should continue to grow until herd immunity has been reached and the dynamic of the epidemic is determined by people’s behavior.
– However, during the COVID-19 pandemic, there have been plenty of cases where the effective reproduction number of the pandemic underwent large fluctuations that, as far as we can tell, can’t be explained by behavioral changes.
– While everybody admits that other factors, such as meteorological variables, can also affect transmission, it doesn’t look as though they can explain the large fluctuations of the effective reproduction number that often took place in the absence of any behavioral changes.
– I argue that, while standard epidemiological models, which assume a homogeneous or quasi-homogeneous mixing population, can’t make sense of those fluctuations, they can be explained by population structure.
– I show with simulations that, if the population can be divided into networks of quasi-homogeneous mixing populations that are internally well-connected but only loosely connected to each other, the effective reproduction number can undergo large fluctuations even in the absence of behavioral changes.
– I argue that, while there is no evidence that can bear directly on this hypothesis, it could explain several phenomena beyond the cyclical nature of the pandemic and the disconnect between transmission and behavior (why the transmission advantage of variants is so variable, why waves are correlated across regions, why even places with a high prevalence of immunity can experience large waves) that are difficult to explain within the traditional modeling framework.
– If the population has that kind of structure, then some of the quantities we have been obsessing over during the pandemic, such as the effective reproduction number and the herd immunity threshold, are essentially meaningless at the aggregate level.
– Moreover, in the presence of complex population structure, the methods that have been used to estimate the impact of non-pharmaceutical interventions are totally unreliable. Thus, even if this hypothesis turned out to be false, we should regard many widespread claims about the pandemic with the utmost suspicion since we have good reasons to think it might be true.
– I conclude that we should try to find data about the characteristics of the networks on which the virus is spreading and make sure that we have such data when the next pandemic hits so that modeling can properly take population structure into account.

I agree with Lemoine that we don’t understand well what is going on with covid, or with epidemics more generally. I agree, and, as many people have recognized, there are several difficulties here, including data problems (most notably, not knowing who has covid or even the rates of exposure etc. among different groups); gaps in our scientific understanding regarding modes of transmission, mutations, etc.; and, as Trisha Greenhalgh has discussed, a lack of integration of data analysis with substantive theory.

All these are concerns, even without getting to the problems of overconfident public health authorities, turf-protecting academic or quasi-academic organizations, ignorant-but-well-connected pundits, idiotic government officials, covid deniers, and trolls. It’s easy to focus on all the bad guys out there, but even in world where people are acting with intelligence, common sense, and good faith, we’d have big gaps in our understanding.

Lemoine makes the point that the spread of coronavirus along the social network represents another important area of uncertainty in our understanding. That makes sense, and I like that he approaches this problem using simulation. The one thing I don’t really buy—but maybe it doesn’t matter for his simulation—is Lemoine’s statement that fluctuations in the epidemic’s spread “as far as we can tell, can’t be explained by behavioral changes.” I mean, sure, we can’t tell, but behaviors change a lot, and it seems clear that even small changes in behavior can have big effects in transmission. The reason this might not matter so much in the modeling is that it can be hard to distinguish between a person changing his or her behavior over time, or a correlation of different people’s behaviors with their positions in the transmission network. Either way, you have variation in behavior and susceptibility that is interacting with the spread of the disease.

In his post, Lemoine gives several of examples of countries and states where the recorded number of infections went up for no apparent reason, or where you might expect it to have increased exponentially but it didn’t. One way to think about this is to suppose the epidemic is moving through different parts of the network and reaching pockets where it will travel faster or slower. As noted above, this could be explained my some mixture of variation across people and variation over time (that is, changing behaviors). It makes sense that we shouldn’t try to explain this behavior using the crude categories of exponential growth and herd immunity. I’m not sure where this leads us going forward, but in any case I like this approach of looking carefully at data, not just to fit models but to uncover anomalies that aren’t explained by existing models.

Design of Surveys in a Non-Probability Sampling World (my talk this Wed in virtual Sweden)

My talk at this conference in honor of Lars Lyberg on 1 Dec:

There are two steps of sampling: design and analysis. Analysis should respect design (for example, accounting for stratification and clustering) and design should anticipate analysis (for example, collecting relevant background variables to be used in nonresponse adjustment). In recent decades, many techniques have been developed for inference from non-probability samples. We discuss what the existence of these methods implies for design and data collection. What is the role of probability sampling in this world?

P.S. Here’s the youtube of it. I can’t stand seeing myself on video, and I find non-live talks difficult to watch in any case. But here it is in case you want it.

A little correlation puzzle, leading to a discussion of the connections between intuition and brute force

Shane Frederick wrote:

One of my favorite questions to ask people is: r(a,b) = x; r(b,c) = x; r(a,c) = 0. How big could x be?

The answer feels unintuitive to me. And it is unintuitive to almost all, I’ll add.

I took this as a challenge. It should feel intuitive to me, dammit! I’ll let you think about it too before going on.
Continue reading

“What, are you nuts? We don’t have time in AP Stats to explain to students what stats actually means. We have to just get them to grind through the computations.”

Andrew Vickers (see here and here) had this fun story about the book he wrote a few years ago, “What is a p-value anyway?” He writes:

Early on, my editor was browsing listservs and came across an AP Stats teachers group that was raving about the book (“p.43 really helped me understand confidence intervals etc etc”). So he wrote to a couple of the teachers saying, “Glad you like the book, would you assign it for your class?”, in an attempt to sell books. They wrote back and said, “What, are you nuts? We don’t have time in AP Stats to explain to students what stats actually means. We have to just get them to grind through the computations.”

Ouch!

P.S. My blurb on the Vickers book: “It’s friendly, accessible, and readable. I like it a lot.”

Scabies!

When talking about junk science, or bad research, or fraud, or mixtures of these things (recall Clarke’s Law), we often talk about the role of scientific journals in promoting bad work (with Psychological Science and PNAS being notorious examples), being defensive and slow to admit problems (Lancet, more than once), playing the business-as-usual game (lots of examples), and flat-out refusing to issue corrections even when pointed out to them (lots more examples).

Another problem is propaganda journals. I’m not talking now about medical and public health journals which occasionally play a propagandist role by not looking hard at papers that push a liberal political agenda, nor am I talking about traditional propaganda such as CIA-funded journals or commercial propaganda such as the apparently all-too-common practice of research claims being dictated by pharmaceutical companies, etc. Nor am I talking about so-called predatory journals that exist not to push an agenda, scientific or otherwise, but just to make money by conning authors into paying for publication and conning promotion committees into count those publications. Rather, here I’m talking about entire journals created to push some pseudoscience, with today’s example being vaccine denial.

My first thought when seeing an entire journal devoted to a fake science was annoyance, but that initial reaction is really missing the point. Yes, it’s annoying that there are people out there pushing ESP, ghosts, climate change denial, vaccine denial, evolution denial, etc.—and that’s not even getting into the various noxious forms of historical denial—but I guess the real problem here is not the existence of the journals so much as that there are enough people who are confused—passionately confused—that they go to the trouble of putting together these journals in the first place. Conditional on such people existing, yeah, sure, they should definitely set up journals. It’s a free country! Also, apparently misguided theories sometimes do contain truths, so maybe these journals play a potentially valuable role as safe spaces where true believers can share their theories and maybe turn up something useful for the rest of us, if only by accident.

Anyway, my main point here is not whether these journals should exist, or how many such journals there should be, or whether a journal on a fake science like ESP is better or worse than a journal on some popular but unverifiable religious belief, or whether I’m violating the spirit of St. Feyerabend and being “patronizing” and “punching down” by even suggesting that there are people out there who have M.D.’s or Ph.D.’s after their name but don’t know what they’re doing . . . whatever.

No, my main point is that often in our discussions of published research incompetence or misconduct (again, recall Clarke’s Law), we hope or demand or expect or wish that the journal that published the bad thing will remedy the problem. But when it’s junk science published in a junk journal, there’s no hope! Pretty much the entire reason for these journals is to push an agenda and to provide a place for people who push that agenda to publish their papers, so of course that’s what they do. To expect a journal of fake science to retract a paper because it does poor science would be like . . . oh, I dunno, it would be like the House of Lords expelling some Lord Thistlethwaite type for being too snobby.

How much does this bother me? It depends on the field. Arguably, even the junk science on astrology or ghosts is doing some damage, at least to the extent that it degrades the reputation of science more generally and takes resources away from more worthy projects such as Game of Thrones. Junk science such as the critical positivity ratio or himmicanes is a bit worse, as these are shiny objects that attract not just feature stories but also can fool respected science writers. I’d give a break to cold fusion and speculative cancer cures, at least at first, because they fall in the “big if true” category.

Then there’s vaccine denial, which seems much worse to me, as it’s killed hundreds of thousands of people already. I can’t quite say that the researchers who publish vaccine denial papers are immoral, exactly, as many of them might be sincere in their beliefs—they can’t all be political hacks or irresponsible media hounds, and statistics is hard. But, as we all know, bad deeds can be done by people who don’t understand what they’re doing.

I thought about all this after reading this Retraction Watch article about a university lecturer in New Zealand who published a fatally-flawed paper claiming a negative effect of vaccines in a journal published by a vaccine denial group (who, for better or worse, don’t have access to the same high-quality web design as the Hoover-adjacent Panda organization). What was interesting here is that the lecturer’s employer got involved:

Robert Scragg, the head of the School of Population Health at the University of Auckland, where Thornley is employed, took the unusual step of demanding the retraction of the Thornely and Brock paper.

In an email to the institution, which was posted on Twitter, Scragg wrote that the article — in a “low ranking non-indexed journal” — includes a “major error” and called on them to:

immediately publicly retract their article because of the anxiety it is creating for expectant parents and those planning to have a child.

The authors took the hint and retracted their paper.

It’s hard to know what to think about this. On one hand, I don’t like the idea of research being policed by one’s employer. On the other hand, the author teaches in the epidemiology department, and it’s pretty ridiculous to have an epidemiologist pushing anti-vaccine propaganda (or pushing incompetent anti-vaccine work). Everybody makes mistakes, but mistakes that pseudoscience talking points that are killing people, that’s really bad. Then again, academic freedom. Then again, the head of the school has academic freedom too . . .

Here’s the researcher’s posted self-description:

Avid reader, cyclist, teaches applied statistics, uses R . . . hey, he’s practically talking about me! On the differences side, I’m not much of a photographer and I eat tons of carbs and sugar.

Googling this guy led to this news article from 31 August 2020 where he was quoted as saying:

“Looking at the science, I believe an effective vaccine is a very remote possibility for COVID-19,” said Dr Thornley.

“We know that the world record in terms of vaccine development is four years – that’s with mumps, from the Merck company. We know that most of them take 10 years – they need to be carefully evaluated. These early vaccines that are coming out of Russia I’m very sceptical they’ve been really well tested in long-term studies.”

He said discussions with vaccinologists he knows have led him to be sceptical.

“Hanging out for a vaccine is not an option… a fantasy, in my view.”

I guess the next step after saying the vaccine won’t happen is to deny the vaccine’s effectiveness and to make up stories about its hazards. Kind of funny that he coordinates a course on Evidence Based Practice. Maybe the Hoover Institution could hire him to head up a new biostatistics department?

Drawing Maps of Model Space with Modular Stan

This post is by Ryan Bernstein. I’m a PhD candidate at Columbia with Andrew and Jeannette Wing.

Right now, each probabilistic program we write represents a single model. As we tackle a modeling task, we try one model after another, and old programs pile up in our “models” folder or git history. What if instead of describing one model at a time, we focused on drawing a map of our model space? What if a program represented not a single probabilistic model, but a whole network of models and their rich interconnections, a region of model space that grew as we explored?

If we drew maps of model space as we experimented, it would be easier for others to follow our work. We could annotate regions of our map, like “These models rely on a big assumption,” or “Here, there be dragons convergence issues.” We could draw our path of exploration, from our first experiment to our final model and all of the false turns along the way. We could document, “This way was too slow”, or “This way gave strange prior predictions”, or “I didn’t have time to explore this way, maybe you should.” Issues like the Garden of Forking Paths and P-hacking make it clear that how and why a model was grown is necessary context for assessing its trustworthiness; model space maps would act as living documentation of that context.

Having a map would also help us navigate through model space as we explore. We could orient ourselves in model space by contrasting our current model to its immediate neighbors: we could say things like, in that direction, a posterior mean trends upwards, or calibration issues appear, or predictive accuracy improves.

Once we sketch out a region of model space, we could deploy robot servants to help us understand it. They could scout ahead for models that are accurate, simple, or robust. They could also check that our conclusions don’t depend too much on any one turn we made along our route, ala the Garden of Forking Paths. Maps enable automation, and automation can save us time and catch our errors.

So what kind of program can represent a growing network of models while remaining concise and readable?

I’ve developed a simple metaprogramming feature called ‘swappable modules’1 that can extend languages like Stan to do just that. Each ‘modular Stan program’ can compile down to a whole set of normal, ‘concrete’ Stan programs. For now we’ll treat model space as a discrete2 network of models and we’ll assume that the data is consistent across the network.

First I’ll introduce ‘modular Stan’ alongside a trivial example, and then I’ll give two more interesting case studies. You can play along at home with my prototype compiler and interactive visualizations at http://ryanbe.me/modular-stan.html. You can load each example with the buttons on the lower left, or you can write your own modular program. Expect plenty of bugs!

A way to program maps

Suppose we have some data x that we believe to be normally distributed. Let’s write a modular Stan program to represent a region of the model space of x.

Modular Stan programs start with a base, and then they are built up by adding one piece, or module, at a time. The base is part of every model we produce, so it should only contain things we’re sure won’t change: the shape of the data, the key inferred quantities, and a basic outline of the model. Here is the base for our `x` modular program:

data {
    int N;
    vector[N] x;
}
model {
    x ~ normal(Mean(), Stddev());
}

So far this is like a Stan program except that it has two “holes” in it, Mean and Stddev, that represent as-yet undefined behavior.

Next, we write a module to fill in each hole. A module can contain arbitrary code and declare new parameters like a miniature Stan program, and it can also return a value.

Here’s a module to fill in the Mean hole:

module "standard" Mean() {
    return 0;
}

This reads as: define a module called "standard" that fills the Mean hole with 0.

Here’s a module to fill Stddev:

module "standard" Stddev() {
    return 1;
}

Our modular Stan program up to this point is equivalent to a single concrete Stan program:

data {
    int N;
    vector[N] x;
}
model {
    x ~ normal(0, 1);
}

This isn’t very interesting, so let’s explore the model space a little bit more. We can explore by simply adding more modules.

Let’s try modeling the mean of x as a random variable. We add a module called "normal" to fill the Mean hole with a new parameter that has a prior:

module "normal" Mean() {
    parameters {
        real mu;
    }
    mu ~ normal(0, 1);
    return mu;
}

If we use "normal" to fill Mean and "standard" to fill Stddev, the concrete Stan program will be:

data {
    int N;
    vector[N] x;
}
parameters {
  real mu;
}
model {
    mu ~ normal(0, 1);
    x ~ normal(mu, 1);
}

Since we now have two ways to fill the holes, our modular Stan program is equivalent to a set of two concrete Stan programs.

Now suppose we also want to model the standard deviation, but we aren’t sure whether or not to use an informative prior (sorry for the contrived example). We can leave that decision as another hole:

module "lognormal" Stddev() {
    parameters {
        real<lower=0> sigma;
    }
    sigma ~ lognormal(0, StddevInformative());
    return sigma;
}
module "yes" StddevInformative() {
    return 1;
}
module "no" StddevInformative() {
    return 100;
}

StddevInformative is a hole nested inside the module "lognormal". In this way, modular programs are hierarchical as well as branching. We can view modular programs as trees3:

Our modular program now represents a set of six concrete models, one for each way the set of holes can be filled: two ways for Mean ("standard" and "normal") times three ways for Stddev ("standard", "lognormal" with StddevInformative: "yes", and "lognormal" with StddevInformative: "no".).

How do we build a network from this set of programs? There’s a natural definition: we draw an edge between two models if they differ by only one module4. Each neighbor is then one ‘decision’ away5. Here is the network of models for this trivial example:

A representation of the network of models for the “x” example. Edges are annotated with “hole” that differs between the two models. The prototype website has an interactive version of this visualization.

‘Golf’ Example: A travelogue through model space

I’ve rewritten Andrew’s golf case study as a modular program to show what it would look like to write a case study as a “travelogue” of development through model space.

In the case study, the modeling task is to understand a golfer’s chance of sinking a shot given their distance from the hole. Like before, we’ll start by writing an outline of the problem that will be constant for the whole model space:

data {
  int J;        // Number of distances
  vector[J] x;  // Distances
  int n[J];     // Number of shots at each distance
  int y[J];     // Number of successful shots at each distance
}
model {
  y ~ NSuccesses(n, PSuccess(x));
}

We start with two holes: NSuccesses is the distribution of the number of successful shots given the number of attempts n and the probability of success; PSuccess is the probability of success given the distance from the hole x.

A good way to count successes is the Binomial distribution, so let’s add a module called “binomial” to fill NSuccesses:

module "binomial" NSuccesses(y | n, p) {
  y ~ binomial(n, p);
}

Now we need a module for PSuccess to map the distance of a shot x to the probability of success. We start with a logit function:

module "logistic" PSuccess(x) {
  parameters {
    real a;
    real b;
  }
  return logit(a + b*x);
}

We’ve now defined logistic regression, the first model from the case study:

data {
  int J;        // Number of distances
  vector[J] x;  // Distances
  int n[J];     // Number of shots at each distance
  int y[J];     // Number of successful shots at each distance
}
parameters {
  real a;
  real b;
}
model {
  y ~ binomial(n, logit(a + b*x));
}

In this fashion, we can add a new module for each step of Andrew’s logic in the case study. See the full modular program here: http://ryanbe.me/modular-stan.html?example=golf.

Let’s use my prototype graphical interface to visualize the resulting modular program. Here are the parts of the interface:

1. Modular program editor with compile and load buttons. 2. Interactive module tree. 3. Interactive network of models. 4. Selected concrete Stan program, if any. 5. Label and documentation editor for the selected concrete program, if any.

We can select modules to narrow down the network of models to a single concrete model, as shown in the figure at the top of this post.

And finally we can see the case study as a “travelogue” that documents a path through the network of models:

‘Birthday’ Example: Exploring model space with “scouting robots”

To demonstrate automation on the network of models, I’ve rewritten the Birthday case study from as a modular program. I collaborated with Hyunji Moon to build this demo.

The goal of the birthday case study is to understand trends in the number of births over time. The authors use a Gaussian process time series model, and they experiment with including a few different types of trends (e.g. day-of-week, day-of-year, holidays, seasonal) and other modeling decisions (e.g. the weighting structure of days-of-week scores and the choice of hierarchical priors on day-of-year scores). When each trend inclusion or other modeling decision is represented as an independent choice of module, the modular program represents 120 models.

You can load the example with the full source code at http://ryanbe.me/modular-stan.html?example=birthday.

Here is the network of modules:

Suppose we want to find a model that does a reasonable job of fitting our time-series birthday data. That’s a lot of possibilities to explore manually! So let’s enlist a friendly robot to search the network for promising models.

We send off our robot with instructions for a greedy search:

“Start here.
Score the neighbors of every model you visit.
Move to the highest scoring model you’ve seen so far.
If you’re already at the highest scoring model, stop searching.”

This is like discrete gradient descent. We equip our robot with a scoring method: Expencted Log-Predictive Density (ELPD) is a reasonable choice that approximates goodness-of-fit.

Here is the data our robot collects:

The red annotations show the ELPD scores for the assessed models. The robot traveled along the arrow path from [START] to [GOAL].

In this case the robot ended its search at the case study’s final model. This optimization amounts to simple symbolic regression on probabilistic programs.

While greedy maximization of ELPD is naive and we shouldn’t blindly trust it to give us our final model, we can at least use it to find promising neighborhoods. I hope this serves as a proof of concept to open the door for other, smarter network-wide algorithms.

Future work

I hope that model space “maps” will give us new “navigation” tools for development, enable automatic search and validation, and help others to understand and reuse our work.

There’s a ton of opportunity for development in this area, such as:

  • What other algorithms can run on a network or neighborhood of models? What other scores would be useful to optimize?
  • How should we decorate network edges to support model navigation? What about for special cases like causal inference?
  • Can we validate modules themselves6?
  • Can we build libraries of modules?
  • Can we learn to synthesize modules?
  • Can we automatically ensemble or stack models in a network?
  • Can we include automated model transformations in the network in addition to module-swapping transformations?
  • If we’re fitting models in a network one after another, can we use their relatedness to sample more efficiently?
  • Are there extensions of the module-swapping system or other metaprogramming features that would be worth the added complexity7?

If you have any questions or ideas, please leave a comment or reach me directly at ryan[dot]bernstein[at]columbia.edu!

Footnotes:

1
A more technical programming languages term might be, ‘module system with non-deterministic functor application’.

2
We can treat the model space as discrete by unifying models that differ only by the values of continuous hyperparameters. Those hyperparameters can be optimized (or modeled) independently from the network of models approach.

3
It is possible for modular programs to not form trees when two modules contain the same hole, but all modular programs form DAGs.

4
Strictly speaking, to account for holes like StddevInformative that aren’t always in use, two models are neighbors if they differ by one subtree of modules.

5
Akin to a Hamming graph.

6
To paraphrase Bob’s 2021 ProbProg talk: modularity is the elephant in the room, but its big challenge is testing modules in isolation. The next best alternative may be to test within a modeling context by a sort of swap-one-out validation.

7
For example: a special type of “hole” that’s filled with a collection of modules and returns their values as an array. This would be useful for cases like including a subset of features in a regression or components of a Gaussian process.

Event frequencies and my dated MLB analogy

Apparently, it’s blog day!

This post is by Lizzie, and I am requesting analogy help (by the way, thanks for your recent help on how to teach simulation to students).

Yesterday morning I watched a little Metro-Vancouver parks worker trundling along in their tractor, as they gathered up the debris strewn across the beach from our recent storm. The storm had been fantastic fun to cycle home during and I snapped some photos on my ride that do not at all do justice to how riled up the ocean looked (one shown). It also triggered the now-almost-normal stream of requests to link “the severe weather effects we are seeing in [insert place] and how this relates to global warming.”

Which led me to trot out my now very old analogy to explain why we cannot generally attribute any one specific weather event (a specific storm, frost, heat wave etc.) to climate change: consider a MLB player, let’s call her Barry…. For the beginning years of her MLB career she was a pretty good hitter and every so often hit a home run. In the later part of her career she starts taking steroids and hits many more home runs on average. You can’t attribute any particular home run to Barry’s steroid use, but you can associate the changing frequency ….

I didn’t come up with this analogy. I copied it from someone who copied it from someone … and on and on until we find someone who thinks he invented it, but I bet he just forgot where he heard it.

And I like it! People generally get the connection and they are sometimes willing to let go of their urge to pressure me to stay, ‘Whoa! What a storm that was yesterday. That storm was caused by climate change, folks.’ And the steroids fits nicely with our juiced-up climate system so it’s often a good segue into what’s changing in our climate system.

But my analogy feels really out of date! It feels old and I think I lose people who try to remember back when or figure out what I am talking about. I am wondering if anyone has (and is willing to share) a better one they’re using, or wants to propose one I can use.

Research on heat extremes is moving towards terms such as ‘nearly impossible in the absence of warming‘ or ‘virtually impossible without human-caused climate change‘  so maybe I can shelve my example someday? But I am not ready for that. (For anyone waiting on rapid attribution of the PNW storm, I suspect World Weather Attribution is working on it.)