This is a report on an empirical investigation into how many education projects funded by the National Science Foundation (NSF) include a randomized controlled trial (RCT) when possible. The answer: only about 24%, and that’s an extremely generous estimate.
Why should you care? Because about 80-90% of education interventions don’t work when tested rigorously. There is no reason for NSF to sponsor over 1,000 grants on education interventions and programs, without putting a higher priority on rigorous methods so that we know what works and what doesn’t.
Any new education or NSF-related legislation should address this deficit in our knowledge as to what works in education.
A Review of the Evidence on RCTs in Education
Sadly, the evidence to date suggests that most ideas in education don’t work, at least not better than the status quo.
In 2013, my former colleagues Jon Baron and others at the Coalition for Evidence-Based Policy reviewed all of the RCTs that had been commissioned as of that date by the Institute for Education Sciences (IES) at the US Department of Education
Out of 90 educational interventions, only 11 of them were shown to have significant and positive effects. That’s only 12%. And when the Coalition limited their review to the 77 interventions studied in trials without substantial flaws (such as a lack of power), only 7 interventions had significant and positive effects, or a mere NINE PERCENT.
All the more reason that we should rigorously test education programs, rather than funding a bunch of programs where only 1 out of 10 will actually work.
What about since 2013?
Let’s take that in three steps.
First, imitating my colleagues’ original search, I went to the IES website and downloaded a list of all evaluation reports since 2013. I selected the ones that were RCTs, and the outcomes are listed below (some are excluded because they are, for example, years one and two of a project whose final report came out after year three).
As you can see, out of 16 RCTs, only 3 had positive results. 3 had mixed results, and the other 10 were null.
Second, I went to the What Works Clearinghouse website at the Department of Education, which reviews educational studies and rates the programs and interventions involved. Since 2013, it has reviewed 105 studies that mention being "randomized." The studies included everything from a remedial education program at the City University of New York, to a program for young children with emotional disorders, to a study of math video games.
Out of 105 studies only 16 produced strong evidence that the program worked, while 6 produced “moderate” evidence. In other words, taking the most optimistic view here, only 22 out of 105 randomized studies (that is, 21%) since 2013 showed even modest evidence that an education intervention or program actually worked.
23 of the studies produced merely “promising” evidence.” What does “promising” mean? An example is a randomized trial on a math instruction program for 3rd graders, where the kids who got the special program scored all of 3 points higher on a scale of over 200 points. Not very impressive.
The largest group (60 of the trials) showed “uncertain effects,” i.e., no evidence that the program worked.
Note that “21% of trials showed that a program worked” is probably too optimistic. For one, these 105 studies are likely not all of the RCTs conducted in the US since 2013, and there is likely some publication bias in what even gets written up and published in a form that the What Works Clearinghouse can review. Thus, the sample of RCTs reviewed here could be skewed, compared to what the Coalition for Evidence-Based Policy reviewed as of 2013. Even so, only around 21% of the trials had apparently moderate to solid evidence of success.
For a third perspective, let’s look at the 2009 American Recovery and Reinvestment Act (section 14007), which set up a large program called Investing In Innovation (known as i3).
The goal was to provide grants to local school districts and nonprofits to implement programs that are “demonstrated to have an impact on improving student achievement or student growth, as well as achievement gaps, dropout rates, college enrollment, etc.
Tons of activity ensued, along with many evaluations. A report by Abt Associates reviewed the results of 148 evaluations; that report is available here. 37 of the evaluations were high-quality RCTs, while 75 involved “quasi-experimental designs or randomized trials that were not well executed”) (p. 9).
26% of the evaluations showed significant and positive effects on student outcomes, although there was a lot of variation here: For “development” grants (i.e., developing new programs), only 15% showed a positive effect, while for scale-up grants (i.e., scaling up a program that already had rigorous evidence behind it), fully 56% showed positive effects:
In short, we can be pretty confident of one thing: No matter where we look, 80-90+% of education programs won’t work (unless you focus only on the limited set of programs that already have strong evidence, and even then a program may only have about a 60% chance of working).
That makes it all the more important for education funders to demand RCTs in cases where an RCT is possible, rather than continuing to fund such programs with no evidence that they work better than alternatives or the status quo, as well as research projects that often don’t reliably answer the questions that they themselves pose.
The National Science Foundation and STEM Education Programs
The National Science Foundation (NSF) has an entire directorate for STEM education. Its Division of Research on Learning in Formal and Informal Settings currently funds over 1,000 research projects on STEM education programs, training programs for teachers, etc.
Question: How many of these research grants even try to deploy an RCT in cases where an RCT would be the best choice?
Early in 2023, I went to the NSF website and downloaded a list of all the current NSF STEM education grants from that division (1,098 of them).
I then reviewed the abstracts for the first 600 of them,1 with an eye for the following:
Does this grant even try to answer a causal question? (I.e., does it ask whether a program “affects” or “impacts” or “leads to” student achievement?)
Many grants weren’t even trying to answer a causal question—they were just aiming to develop a new curriculum or other type of program, or to describe students’ reactions to a museum or an AI module, etc. That said, many of those grants are fairly pointless unless they are prepared to launch an RCT in the future.
If the grant does try to measure causal impacts, and if it’s about the sort of program where an RCT would be possible, does the grant propose to use an RCT design or not?
The goal was to answer a very simple question: Out of all the STEM education grants that could and should have used an RCT, how many actually did so?
Bottom line: Not enough.
Only about 24% of these research projects (73 out of 301) used an RCT where it would have been possible to do so, and where an RCT would have been a better research design.
The Excel chart of my data is available here.
I tried to give these NSF grants every benefit of the doubt. If the study abstract so much as mentioned a word like “experiment” or “randomize,” I counted it as an RCT. For example, one grant said this about the study design: “Finally, researchers will determine the effects on students in an experimental pilot study.” That’s all that was said on that point, and it’s possible that the researchers were using the term “experimental” somewhat colloquially. Nonetheless, I counted it as an RCT.
In another case, the grant abstract said that it used “a quasi-experimental, experimental/comparison group research design,” which is contradictory, but I counted it as an experiment. Yet another grant claimed, “Using a quasi-experimental design, classes will be randomly assigned to the treatment or comparison conditions.” Apparently a number of education researchers don’t know that “quasi-experimental” is not the term to use to describe random assignment to treatment or control, but I counted this as an RCT.
In other words, my results are probably an overestimate that is biased in NSF’s direction. In all likelihood, the true number of RCTs (never mind RCTs that are rigorous and high-quality) is far lower than the estimate here.
***
What do I mean by the above?
Here's one very typical example of an NSF-funded study that, in my judgment, could have used an RCT:
In this project, youth will apply scientific and computational science concepts and practices by learning to create and build automated systems where they can grow their own food. These specialized growing systems will utilize hydroponics where plants are grown without soil. Computational science is an interdisciplinary area that combines skills and knowledge from science, computation, computer science, engineering, and mathematics to solve problems. The prevalence of computational science in industry and everyday lives necessitates the development of a STEM workforce knowledgeable to solve complex real-world problems.
Researchers will train high school youth to program low-cost micro-controllers to collect data and use that data to automate hydroponic systems. To support the high school youth's interest in obtaining a post-secondary degree, they will be mentored by first-generation college students from alumni of Boston College's College Bound program and Lasell University's Pathways to Diversity.
This project will study ways to design an internal learning environment that promotes youth skills and interest in pursuing STEM careers. Utilizing a design-based research framework and drawing upon a mixed-method data collection approach including observations, surveys, interviews, and youth generated artifacts, the project team will explore how and in what ways the use of physical computing coupled with computational science improves youth interest and knowledge to pursue STEM careers. Survey data will be analyzed by first evaluating the measurement quality of the latent variables using confirmatory factor analysis and then using latent variable growth modeling to assess changes in repeated measures for four time points. The research and evaluation work will track 500 youth and their families over time and evaluate what aspects of the program are critical to youth success and STEM career self-efficacy and awareness. This work offers an approach to support youth in examining careers across fields of agriculture, computation, and science teaching.
So here we have a study that will look at 500 youth and their families, to see whether some sort of training in the electronic control of hydroponic systems “improves youth interest and knowledge to pursue STEM careers.” The researchers could have chosen the 500 at random, or even chosen 250 at random with the other 250 being the control group.
Instead of having any sort of comparison group (even one chosen with propensity score matching), they are just going to do observations and surveys over time.
To me, this research project is a missed opportunity to get any sort of rigorous knowledge about whether kids are actually improving their success in STEM education and career readiness (leaving aside whether hydroponic training is actually the best idea to test in the first place).
There are hundreds of similar examples.
CONCLUSION
Which leads to an obvious question: why are we funding so many education research projects with ill-defined jargon and programs/interventions that are marginal at best, while there remain so many unanswered questions about, say, how to use personalized tutoring to remedy the loss in learning seen under Covid, or how to use spaced repetition in the classroom? And why aren’t we using more RCTs to figure out what actually works?
An education funding portfolio shouldn’t be quite so dominated by small-bore, marginal studies with no rigorous way to see what works.
We can do better for our nation’s K-12 students. I could easily imagine transferring a large portion (or even all) of the funding for NSF’s STEM Directorate to the Institute for Education Sciences (IES) at the Department of Education. IES, while not perfect, has funded many more high-quality studies over the past 20+ years. We should double down on high-quality research.
PS: A BROADER NOTE ON EDUCATION RESEARCH
Reviewing several hundred education studies was painfully boring, and this says much more about the education field than about NSF. For many months, I used this project as a cure for procrastination: Whenever I had another project where I was tempted to procrastinate, I’d open up the NSF document, try to review 10+ studies, and then I'd think, "nothing could be as tedious as this project, so I’ll do the other thing."
But why?
It’s not just the endless jargon. It’s the subject matter of so many studies. There are so many studies about make-work projects in the hopes of getting kids interested in math or science by some cheap trick that supposedly will make school more “relevant” to their lives.
There is little or no evidence that any of this works.
We end up with a bunch of programs and projects that are time-wasting in the case of the child who actually is curious about the universe and how to understand it, and time-wasting for everyone else because we never get to see good evidence on how to create more-motivated students.
Over 10 years ago, I maintained a then-anonymous blog called “Wasting Time in School.” It catalogued probably over a hundred examples of time-wasting school projects. A couple of examples were from my own experience (these incidents happened in the Katy ISD schools, widely reputed to be among the best schools in the Houston metro area). Here’s one:
My son is in what seems to be a well-regarded public middle school. In the past few weeks, here are the assignments I’ve seen him working on at home:
making a video about Edwin Hubble (it had very little information about Hubble in it, but the teacher said his was one of the best videos in the class);
making a fake Facebook page about Edwin Hubble (same); and
writing up a description of a fake dinosaur that he had imagined.
I’m not too worried about his science knowledge (when he was 9, he demanded that I subscribe to Scientific American for him to read), but I’m not confident that the school is doing as much as it could to instill knowledge in its students.
In other words, these were all projects designed to “get kids interested!” in science, but that have almost nothing to do with learning anything about science. Making fake Facebook pages isn’t what actual scientists do, and there’s no evidence that kids who dislike astronomy suddenly start caring when given a fake Facebook project. Nor is there evidence that the rare child who actually is fascinated by astronomy would be helped by the fake Facebook project, rather than deterred. In all possible ways, this project seemed like a waste of time.
In fact, the whole idea was depressing. Imagine teaching about Edwin Hubble and the amazing and awe-inducing discovery that the universe is expanding, . . . and then thinking, “Meh, why would kids be interested in galaxies, stars, and the expansion of the entire freaking universe? Maybe I’ll get them to make a fake Facebook page, and then they’ll pay attention.”
Another project from my own experience:
The class: 8th grade English for gifted/talented kids. The assignment: memorize a rap about pronouns. Here’s the rap, if you can stand it:
This is a rap,
all about pronouns.
If you don’t learn it,
your grades are gonna go down.Chorus:
Sit down learn it,
you don’t need a permit.
Memorize it, do it now:
Pronouns take the place of nouns.The SUBJECT list—
It’s nothing new:
I, YOU, HE, SHE,
IT, WE, THEY, and WHO.Chorus.
The OBJECT list—
It’s next we assume:
ME, YOU, HIM, HER,
IT, US, THEM, and WHOM.Chorus.
POSSESSIVE pronouns—
The last of the list:
MY, MINE, YOUR, YOURS,
HIS, HERS, and ITS.
Wait! Still more to use:
OUR, OURS, THEIR, THEIRS,
Last but not least—WHOSE.Chorus one last time.
My son is protesting to me that he has known about pronouns since, oh, the third grade or so. I can’t even imagine what the regular, non-gifted English class is doing. Kindergarten-level work, perhaps?
I was reminded of these experiences way too often when reviewing NSF-funded studies. Here are some examples:
A grant where kids will be asked to “use mobile technology to create digital mathematics stories. These stories will be made from videos, photos, and audio.”
A grant where kids will be asked to make art about “extreme weather,” which will then be displayed on public transit systems.
A grant where kids will be given an app that connects mathematical ideas to their local surroundings, such as “how a painting by a local Latino artist uses ratio and scale, or how a ramp in downtown was designed with a specific slope to accommodate wheelchairs.”
A grant to “investigate the feasibility of using remote cameras to survey local, urban wildlife to promote inclusive practices and youth engagement in STEM.”
A grant to “promote data literacy in high school students by engaging them in learning about the Quantified Self -- the practice of using technology to track and reflect on one’s own biological, behavioral, physical, and/or emotional data.”
A grant for which, “drawing on lesson study and Indigenous research design principles, researchers, teachers, and their students will collaborate to create animated concepts of mathematical ideas. Animated concepts include how students use mental images, material objects, and lived experiences that center Black, Native American, Latina, and newcomer knowledge bases related to mathematical concepts.”
A grant about “extreme heat/urban islands” for which “students will identify locally-relevant issues related to this phenomenon, conduct investigations to explore the issue, share their findings through arts-based community narratives, and advocate for change.”
A grant for a research project that “brings instruction in computing and data science to an elementary school with a predominantly Hispanic population through Project SMART, a digital educational game that uses students’ in-school physical activity to motivate student learning.”
A grant about “MothEd,” which “will focus on the study of moths, which are well-suited to the project’s goal of having students conduct authentic scientific investigations.”
A grant based on the idea that “authentic inquiries into science through embodied learning approaches can provide rich opportunities for sense-making through kinesthetic experience, embodied imagining, and the representation of physics concepts for Black and Latinx teens when learning approaches focused on dance and dance-making.”
A grant to promote scientific engagement as follows: “One strategy to promote STEM engagement and learning is to make clear and meaningful connections between STEM concepts, principles, and STEM-related issues relevant to the learner. Socioscientific issues (SSI) can provide a powerful avenue for promoting the desired kinds of engagement. SSI are debatable and ill-defined problems that have a basis in science but necessarily include moral and ethical choices. SSI for economically disadvantaged, culturally diverse students in urban settings might include, for example, lead paint contamination, poor water or air quality, or the existence of ‘food deserts’?”
A study on how people in Alameda, California, get into gardening.
It could be the case that dance, or moths, or hip-hop, or gardening, or wheelchair ramps, or urban wildlife, are the best possible ways to get kids interested in science. But it seems unlikely, especially if we’re not going to do any studies that generate rigorous evidence.
The premise of so many studies seems to be patronizing: "We can't expect underprivileged kids to care about anything other than hip-hop music and their immediate surroundings, so let's come up with some gimmick to get them interested in STEM for the first time."
Maybe that’s true, but we should at least do some RCTs on the idea.
In fact, let’s look at one actual NSF abstract in full (I could link to it, but I won’t do so—I don’t want to make individual researchers a target of criticism when the real issue is the broader NSF program and its philosophy):
This project is developing and researching tools to support teachers? instructional shifts to achieve equitable sensemaking in middle school science classrooms. Sensemaking involves students building and using science ideas to address questions and problems they identify, rather than solely learning about the science others have done. Despite it being a central goal of recent national policy documents, such meaningful engagement with science knowledge building remains elusive in many classrooms. Students from non-dominant communities frequently do not see themselves as “science people” because their ways of knowing and experiences are often not valued in science classrooms. Professional learning grounded in teachers’ use of innovative high quality curriculum materials can help teachers learn to teach in new ways. Yet teachers need guidance to customize curriculum materials to fit their own local contexts and leverage students’ ideas and experiences while maintaining the goals of recent policy documents. This project is researching and developing customization tools to support teachers in their principled use and adaptation of materials for their classrooms. These customization tools will help teachers to better notice and leverage the ideas and experiences of non-dominant students to support all students in equitable sensemaking. During the project, 74 teachers from diverse schools will participate in professional learning using these customization tools. After testing, the customization tools and illustrative cases will be disseminated broadly to support teachers enacting any science curriculum in leveraging the ideas and experiences that students bring into the classroom. In addition, the research results in the form of design principles will inform future design of curriculum materials and professional learning resources for science.
A key element in science education reform efforts includes shifting the epistemic and power structures in the classroom so that teachers and students work together to build knowledge. Research shows that shifts in science teaching are challenging for teachers. Researchers and practitioners have collaborated to develop curriculum materials that begin to support teachers in this work. But teachers need to interpret these materials and customize the tasks and strategies for their own context as they work with their own students. Curriculum enactment is not prescriptive, but rather a “participatory relationship” between the teacher, curriculum materials, students and context, where teachers interpret the materials and the goals of the reform, and customize them to adapt the tasks and activity structures to meet the needs and leverage the resources of their students. The field needs to better understand how teachers learn from and navigate this participatory relationship and what supports can aid in this work. This project will include design-based research examining teachers’ customization processes and the development of tools to support teachers in adapting curriculum materials for their specific school context to facilitate equitable science sensemaking for all students, where all students engage in ambitious science knowledge building.
The major components of the research program will include: (1) Empirical study of teachers’ customization processes; (2) Theoretical model of teacher thinking and learning that underlies customization of curriculum materials; (3) Tools to support principled customization consistent with the goals of the reform; and (4) Empirical study of how tools influence teachers’ customization processes. The project is addressing the urgent need for scalable support for teacher learning for recent shifts in science education in relation to both a vision of figuring out and equity.
To be sure, I do agree that lots of kids fence themselves off from science and math unnecessarily, and it’s worth thinking about creative ways to get more students to be excited about those subjects and to see themselves as potential scientists.
But the language in this abstract seems off to me—what exactly does it mean to have a “non-dominant” way of “knowing” that isn’t currently valued in science classrooms? Are there ways of “knowing” that are better than actual science? Is there anything to this project besides a bunch of edu-jargon? This (along with a number of other NSF-funded grants) almost reminds me of the ill-fated Smithsonian webpage from 2020 claiming that examples of white supremacy include “emphasis on scientific method,” “objective, rational linear thinking,” “cause and effect relationships,” and “quantitative emphasis.”
To be clear, there were some outstanding NSF-funded studies. This one was probably my favorite:
This DRK-12 Impact study will investigate the efficacy of the Precision Mathematics First-Grade (PM-1) intervention through a methodologically rigorous randomized controlled trial. The study will utilize a randomized block design, blocking on classrooms and randomly assigning first-grade English learners (ELs) who face mathematics difficulties (MD) within first-grade classrooms to one of two conditions: (a) PM-1 intervention or (b) control (business-as-usual). Approximately 900 ELs from 150 first-grade classrooms will participate. Three research aims will guide this study. Aim 1 will systematically evaluate the average effect of PM-1 on student mathematics achievement; while Aim 2 will investigate differential response to the intervention based on student-level variables, including ELs proficiency in English and pretreatment mathematics performance. In Aim 3, researchers will explore whether the frequency and quality of mathematics discourse opportunities for ELs predicts gains in mathematics achievement. Although random assignment will take place at the student level, students will be assigned to small instructional group formats for intervention delivery. Therefore, the design employs a partially nested mixed-model Time × Condition analyses to evaluate the effect of PM-1 on pretest to posttest gains in mathematics achievement (Aim 1) and differential response to PM-1 based on student characteristics (Aim 2). A random coefficients analysis that nests repeated assessments within students and PM-1 intervention groups will explore whether the rate and quality of mathematics discourse opportunities predicts ELs’ gains in mathematics achievement (Aim 3).
My only point in this whole report is that we need many more studies like that, and fewer small-scale, non-randomized studies on interventions that seem fairly marginal.
In an ideal world, I could have reviewed all 1,098 of them, but it was already unbearably tedious just to review 600 (around 75 of which later turned out to be duplicates—I wish I had looked for duplicates at the outset!). A quick skim of the rest of the grants didn’t look like they would be any different in terms of subject matter or study design, and there’s no reason to think that NSF’s list was ordered based on RCT status. Thus, the first 600 entries are probably representative of the rest of the list.
Reminds me of this gwern post: https://gwern.net/doc/sociology/1987-rossi
One fruitful avenue might be asking NSF people what regulatory barriers there are to carrying out RCTs in schools? Are there any IRB issues / procedures NSF has to follow that make this burdensome? ideas for reforming them, etc.
I agree completely. RCTs are criminally underused within government social programs. Education is a perfect environment for implementing RCTs at scale because classrooms are a relatively controlled environment and the sample sizes are potentially massive.
I would go so far as to say that it should be the most important thing that the federal and state DOE can do. We should shift most DOE spending from subsidies to programs that might work to RCTs that give us definitive proof which policies actually work.
I have more thoughts on the topic here:
https://frompovertytoprogress.substack.com/p/the-case-for-randomized-trials-in