Some Big Surprises About Small Experiments
For most of my early career at Mathematica, I worked on experimental evaluations in the field of education. Those studies typically focused on how factors like math or reading curricula affect test scores. In education research, we often randomly assign entire schools to a treatment or control group. Because there is a fair bit of variation in test scores across schools, these studies generally require a “large” (at least 30) number of schools. This type of study, in which clusters of individuals are randomly assigned, is sometimes called a cluster randomized controlled trial (CRCT).
More recently, I’ve branched out into other fields in which variation in outcomes across clusters is much lower than in education, creating the potential to design CRCTs with a smaller number of clusters. For example, studies of teen pregnancy prevention programs also often randomize clusters, such as entire communities, but the cross-cluster variation in the outcomes of interest (such as knowledge of contraceptives or sexual initiation) can be much smaller. This led me to wonder: Can we design credible CRCTs with a much smaller number of clusters? And if so, how do we get the most out of the data—that is, how do we get the statistical power to detect small program effects?
Is it OK to randomize a handful of big clusters? Yes!
After spending a lot of quality time with my computer running simulations and corresponding with some very helpful referees from Evaluation Review, I now have some answers to these questions. (My paper, “Design and Analysis Considerations for Cluster Randomized Controlled Trials That Have a Small Number of Clusters,” is included in a special issue of the journal.) The surprising conclusions are that:
- Yes, we can design credible CRCTs with a small number of clusters. (I looked at sizes ranging from 4 to 10.
- Yes, it is possible (though not always easy) to detect small program effects.
- Although we do have to take extra care in how we design the study and conduct analysis, we don’t need to rely on anything fancy—basic design and analysis tools work surprisingly well.
The reason these findings were surprising—at least to me—has to do with the way statistical methods are developed and understood. Most of what we know about the properties of statistical methods depends on the assumption that we have a large number of data records (at least 30). In studies with just 4 to 10 clusters, it was far from obvious that basic approaches like statistical testing could be counted on to work the way we expect them to. It turns out that they do, even without making additional assumptions about the data.
My paper has quite a few recommendations for best design and analysis practices for researchers who want to craft CRCTs with a smaller number of clusters, all backed up by computer simulation work. For government agencies, foundations, and other funders of research, the key takeaways are:
- It is possible for researchers to conduct CRCTs with a small number of clusters. If a researcher tells you it is impossible because the statistics don’t work, or that they have to use exotic (which is to say, expensive) methodologies, you can point them to my paper.
- While it is possible to conduct these studies, there are some important requirements for them to work out well. For example:
- While it is OK for the study to include a small number of clusters, they most likely need to be pretty big (with at least hundreds, but preferably thousands, of individuals in each cluster). For instance, a study could randomize entire Social Security administrative areas (which are often larger than states) to pilot a new approach to disseminating information about work incentives.
- There can’t be a lot of variation across clusters in outcomes of interest—the vast majority of variation in the outcomes needs to be within clusters. For example, in a study that randomly assigns school districts to implement a program to reduce childhood obesity, there should be much more variation in students’ body mass index within school districts than there is in the average body mass index between districts.
It is understandable that many researchers believe that it is not possible to conduct credible CRCTs with a small number of clusters, but it turns out this is a valid approach that can provide useful information to decision makers in a wide variety of policy contexts.