Gold Standard? That Was Abandoned, You Know...

'Gold Standard'? That Was Abandoned, You Know...

Add to Reading List
Add to Reading List

Randomised Controlled Trials (RCTs) have become increasingly popular in economics over the past decade or so, for example making up about 10% of papers in the highly-ranked journal the American Economic Review. They represent an attempt to emulate an experimental environment by randomly assigning a ‘treatment’ to one group (the treatment group) and not to another group (the control group). The difference in economic outcomes between the two groups is then considered the effect of the treatment, which could be anything from an education subsidy to job training to a free condom program. RCTs are referred to as the "gold standard" of empirical econometric research by major economists such as Joshua Angrist and Stephen Pischke (hereafter AP), with many other popular methods based on emulating this ideal.

But there is no reason to believe that RCTs as practised have fulfilled this ideal. There is also good reason to believe that they cannot do so even in principle. The idea, popularised by advocates such as AP that RCTs can give us a pure causal effect of treatment, capturing ‘what works’, is all too often accepted uncritically. The inevitable practical shortcomings of RCTs are not formally incorporated into the analysis to see how they might affect the results. Finally, the ethical basis for RCTs is at best conditional on a number of questions which are not typically addressed in the literature (AP's textbook contains the word 'ethic' once, when discussing an especially extreme example where police responses were determined partly by randomisation. It contains the word 'moral' twice, in both cases to convey a technical point rather than in a discussion of morality).

RCTs have their roots in medicine, and are generally considered a necessity for testing the effectiveness of new drugs. Give some people the drug; don’t give it to others, then document the difference in health outcomes between them. Applied economists aspire to emulate the success of RCTs in medicine: for example, Esther Duflo, Rachel Glennerster, and Michael Kremer have stated that “creating a culture in which rigorous randomized evaluations are promoted, encouraged, and financed has the potential to revolutionize social policy during the 21st century, just as randomized trials revolutionized medicine during the 20th”. But economists seem to have imported RCTs from medicine with none of the habitual robustness checks and without truly considering whether they can be applied to economics in the same way they can be applied to medicine.

A major issue with RCTs is that when you are experimenting on people, the existence of the trial itself will affect their behaviour. In medicine, this usually occurs acutely in the form of the placebo effect: just taking a pill can make people better, even if the pill itself has no healing properties. Luckily, there is an obvious solution to this problem: give the control group an ‘empty’ pill or perhaps find another way to make them think they’ve been treated. This cannot always be applied depending on the nature of the treatment, but it is certainly widespread practice, with ‘Usual Care” sometimes replacing a pure placebo if the treatment is a new version of something patients need anyway, such as insulin shots. Unfortunately, this practice cannot usually be applied in economics since it is not really possible to ‘hide’ which treatment someone is getting in a social experiment. People know whether or not they have been given a subsidy or a training program. As James Heckman et al have documented, this can bias the results if those not given the treatment substitute for it with a similar treatment from another source, which is fairly common. It is also common for some who are supposed to be given the treatment not to take it, causing further bias.

Consider also the issue of 'follow-up', where researchers are unable to find out what happened to participants after the trial, either because they cannot be found or because they refuse to answer questions. In medicine, one common way to deal with this is to use 'best-worst case' sensitivity analysis. For example, if 4% of a sample are lost to follow-up in an RCT for a drug to prevent heart attacks, the study asks how the conclusions might change if all 4% of them had a heart attack, and also how the conclusions would change if none of this 4% had a heart attack. This is extreme, but it helps practitioners to consider the worst case scenario, and they can’t afford to be too careful. From what I've seen, follow-up rates in economics are generally much higher than in medicine, with good reason to believe that those who are lost are non-random. Yet economists do not practice the same kind of critical bounding of their results.

Do RCTs Advance Understanding?

RCTs in economics are frequently held up as an example of moving away from the obsession with theory and allowing the empirical evidence to speak for itself. This may be a laudable aim in many ways. However, there’s good reason to believe that RCTs cannot substitute for cumulative progress in understanding informed by both theory and evidence. As Angus Deaton has pointed out, RCTs in economics are rarely used to test theories but instead form a disconnected patchwork, each investigating a particular project without necessarily giving a good idea of which mechanism is under investigation. The result is that it’s not always clear when or how the results can be applied to a different context. 

For example, RCTs only recover an average effect and there are common scenarios where this may not be useful. One such scenario is where there are "two internally homogeneous subgroups with distinct average treatment effects, so that the RCT delivers an estimate that applies to noone". Understanding when this will happen requires an understanding of mechanisms rather than just of 'what works'. Deaton argues that these kinds of complications typically require RCT practioners to adopt other econometric methods to interpret their results, bringing with them all the assumptions that they hoped to avoid by using RCTs in the first place. 

However, while Deaton himself sees a positive role for economic theory in informing these problems, the following example illustrates that economic theory can also be used in the kind of ad hoc manner he warns against. Duflo, Dupas and Kremer (DDK) study an RCT in Kenya which tests the impact of education subsidies and sex education on a number of outcomes, in particular early fertility and STDs. 19,289 students at randomly selected schools were given one of the following:

1.            Control group (nothing);

2.            A subsidy of 2 school uniforms (in Kenya this is quite significant);

3.            3 teachers at their school with training in sex education;

4.            Both the teachers and the subsidy.

The students were then followed up 3, 5 and 7 years after the trial. 

The results are curious. (2) reduced school dropout rates and early fertility, but not STI rates. (3) did not have much of an impact on anything. (4) reduced early fertility, but by less than (1); it also had no significant impact on dropout rates, but it reduced STI rates substantially. In other words: (3) had no effect while (2) affected some outcomes, but (4) – which is a combination of (2) and (3) – did not seem to produce results that were a simple combination of the results from (2) and (3).

Of course, strange results by themselves are not sufficient reason to dismiss an experiment; that would verge on an argument from personal incredulity. What makes science interesting is that different aspects of the world can interact in unpredictable ways. DDK opt for this type of explanation by developing an economic model which shows that when (2) and (3) are combined, plausible reactions from students can produce outcome (4). Perhaps. But there is another, fairly obvious possibility that they do not explore: that the results reflect problems with the study itself.

As mentioned above, it’s usually difficult to gather the ideal data from RCTs, and this case is no exception. Short to medium run data (after 3 and 5 years) was collected directly from schools but was imperfect due to absenteeism. Subsequently, absent students had information collected from teachers or their peers, then the trial followed up a random subsample of 1,420 at home and tested how consistent the data collected at home were with the data collected at school. DDK report about an 80% consistency rate for reports of pregnancy occurrence and casually note that this is “remarkably accurate” before moving on. But what if we assume that the data for all absent students was inaccurate 20% of the time, as implied by the results? How does this affect the main conclusions about early fertility? (I'd answer this myself, but the data on how many absent students there were is not available). There’s no formal, critical evaluation of the impact of this disparity; just a short, unexplored judgement on the part of the authors that it’s not a problem. It’s simply not good enough to eyeball data like this when you’re trying to provide rigorous evidence for the impact of a policy.

The data on long run outcomes are even more problematic. The trial initially managed to follow up 54% of the sample after 7 years, and interviewed almost all of these. Furthermore, the program then ‘intensively tracked’ another 29% of those not already sampled, again most of whom agreed to be interviewed - leading to a total follow-up rate of at most 65% of the original sample. Furthermore, only 25% of girls in the sample had an HIV test administered after 7 years, while only 58% had an HSV2 (a variety of herpes) test administered. I don’t think I need to perform the formal calculations (and again I can't from what's available in the paper) to note that if we apply the above ‘result bounding’ methodology from medicine and assume that even half of those for whom data were unavailable, say, contracted an STI (or didn’t), the results would change completely. This doesn't mean the results are certainly wrong, but it does mean that they are not particularly robust.

DDK make some effort to dismiss the possibility of non-random attrition but again this is not particularly rigorous and follows the logic of the drunk looking for his keys in the street lamp, relying on what's available rather than what's important. DDK's data allows them to compare short term outcomes between those who were tracked easily and those who had to be tracked intensively in the long term. They argue that if those who were harder to find were systemically different, they would also be different in the short term, and dismiss this possibility based on the data. But this raises the twin issues that (a) their analysis excludes those who were never tracked at all (~35% of the initial sample) and (b) they assume that short term outcomes can be extrapolated to long term outcomes. In fact, it is an open possibility – and is suggested by some of their own analysis – that over the long term outcomes began to change more quickly, and that changes in circumstances were likely to be a reason that participants couldn’t be followed up. All in all, when combined with the overall curious nature of the results and the most tables are filled with statistically insignificant values, the suggestion that non-random attrition could be driving some of the results seems at least worth exploring.

To be clear, I’m not arguing that DDK’s interpretation of the results is wrong. It could be right, and their discussion is actually convincing in places. The problem is that they do not seriously consider the possibility that the data may be biased and how this would affect the results. Their explanation of the results is quite divorced from the purported rigour of the RCT itself and drives home the importance of an understanding beyond just the 'gold standard' of evidence provided by the experiment. Furthermore, the authors admit that they “do not present a formal test” of their model – which was seemingly constructed after the fact – but merely show that for some parameter values, it “can match all our empirical results”. But can we rule out the possibility that the model’s assumptions have just been cleverly designed to do this? Furthermore, is there any feasible set of results that DDK wouldn’t have been able to develop a model for, ex ante

This ad hoc approach to both theory and practice – and, to be frank, the general opacity of the paper itself – left me with a feeling that I was being sold a dodgy car throughout. A more critical discussion of which theory was being tested and how problems with the trial should be handled – all submitted before the trial took place (as happens with FDA trials in medicine), or at least before the data were received by the practioners – would help to address these doubts and to make the results more robust. I don’t think it’s unfair to say that these things are not common practice in RCTs in economics.

Playing God

A brief word on the most important issue underlying all of this: ethics. RCTs in medicine are used because the effect of a drug or treatment is never entirely predictable, and must satisfy “equipoise”, where doctors believe that the patient has an equal chance of getting better under both treatments. This is the only way to justify randomly prohibiting some people from a treatment. However, in economics, the ‘treatment’ in RCTs is often something that is unambiguously good (and certainly not bad), such as free condoms, a scholarship, or job training. Subsequently, Tim Harford's question "how could we justify prescribing treatments without knowing whether or not they work?" does not really apply in economics. We know these things won't hurt people, and they may help. Therefore there is no reason not to give them to as many people as possible. Since RCTs are themselves quite costly, doing this more often could provide a good deal of assistance to the most vulnerable people in the world.

To be sure, it’s possible to imagine – given our current, unethical political system – a case where the only way to expand a desirable program is to use an RCT to convince decision makers at an international institution. But this is a far cry from the present habitual use of RCTs to address development question without any discussion of ethics whatsoever. Furthermore, the practical issues with RCTs documented above give us good reason to expect that alone, they cannot provide the definitive evidence a policymaker needs, and therefore may not qualify to play this role. 

Finally, none of this is to lionise RCTs in medicine, which have many ongoing issues, including ethical ones (as illustrated by Dallas Buyers' Club). Nor is it to claim that no RCT in economics has been conducted properly or yielded useful results. The claim is simply that importing the ideal of what we might like RCTs to achieve, without formally addressing their limitations or placing them in their scientific and ethical context, means that they are much less likely to be useful or justified. I fully expect one response from economists to be “well, yes, we know there are issues with RCTs, thank you”. But this misses the point. I don’t deny that many practioners will be aware of these issues – several of the papers I’ve cited came from mainstream journals, and any class or textbook on RCTs will contain some discussion of issues such as dropout, external validity and ethics.

Despite this, there is no systemic approach to handling these problems in the literature. There are no agreed standards for making your hypothesis clear before the trial is conducted. There are no agreements on ethical guidelines or limitations. And there is no agreed way to handle the fact that no RCT will conform perfectly to the ideal in practice and to account for this in the results. Once these issues are made explicit, it becomes clear that not only do RCTs in practice fail to deliver results that qualify the method as a ‘gold standard’ of research, but there is no reason to expect they can in theory. This is why claims that RCTs - as well as related methods – have ‘revolutionised’ empiricism in economics should be met with scepticism.


Keep up to date with the latest thinking on some of the day's biggest issues and get instant access to our members-only features, such as the News DashboardReading ListBookshelf & Newsletter. It's completely free.


Twitter Feed

RT @mybuchshelf: Are book collectors real readers, or just cultural snobs? – via @aeonmag

A collection of some of the best econ books of the year, feat - @ryanavent, @BrankoMilan, @g2parker and more...…

RT @mark4harrison: Blogged: Donald Trump and America's Incomplete Contract with Itself @warwicknewsroom @cage_warwi…

RT @NIESRorg: The weak pound in your pocket: @angusarmstrong8 continues to make waves with his blog post, this time in the @FT https://t.c…

RT @LSEReviewBooks: Review Archive: The Sharing Economy: The End of Employment & the Rise of Crowd-Based Capitalism by Arun Sundararajan ht…