Measuring Impact Through Field Experiments
It is important to remember that while all this studying is great, innovators should place a higher priority on helping people when some program or product is effective.
This post is from PopTech Editions III—"Made to Measure: The New Science of Impact," which explores the evolving techniques to accurately gauge the real impact of initiatives and programs designed to do social good. Visit PopTech for more interviews, essays, and videos with leading thinkers on this subject.
Which of these innovations do you think changed the world more fundamentally: televisions or refrigerators? The short answer is that we don’t know. The longer answer is that the question is flawed. What do you mean by changing the world? What do you mean by fundamental? For whom? How? Why?
The same is true for programs that aim to improve social or economic development. How do I know that a given program reduces poverty and improves human well-being? How do I know if this is the best investment I can make to improve the welfare of the people I serve?
There are two problems with answering these questions. First, people are different in ways that can be observed and some that can’t. The second problem is that we do not live in a lab. It is not just the innovation that presents a change in our environment, but it is accompanied by millions of other changes occurring concurrently in our real-world lives. How then can I honestly attribute a change in people’s lives to just the innovation that I have introduced, and not to any of the other hundred things that changed at the same time?
One approach that is gaining prominence in resolving this quandary is to conduct field experiments on a large enough scale to rigorously tease out the effect of an innovation on the average person, on people with different characteristics, in the real world, and in a manner in which we can clearly attribute the observed change to the innovation we have introduced. The approach is very similar to that used in evaluating the effectiveness and drawbacks of new drugs through medical trials. This technique can provide organizations designing and promoting these innovations, as well as the governments supporting and regulating them, with a great deal of insight on how to allocate scarce resources.
Let's use the example of the television and the refrigerator. You would start by choosing a population from which you pull a random sample of people, so their characteristics approximate the distribution of attributes in the population. You would collect a set of information from all of these sampled households on the way they live, their consumption patterns, their health, their learning, their well-being.
You would then hold a lottery in which you put all the sampled people’s names into a large bucket and you close your eyes and pull out a third who will get televisions, another third who will get refrigerators, and the remaining third who will get neither.
The three groups are statistically equivalent, mirrors of one another. And so we expect them to evolve in equivalent fashions, facing all the shocks in the world in a similarly diverse manner, except for the fact that some of them by chance have a television, and others a refrigerator, and others neither.
We then return to these families a year or two later and ask them the same or a very similar set of questions on the way they live, their consumption patterns, their health, their learning, their well-being. We then compare the difference in the change in consumption, health, learning and well-being patterns of those with televisions, with those with refrigerators, and those with neither.
The reason this research question makes for a valuable experiment is because we do not know the answer to this question with any reliability beforehand. Our intuition and the existing evidence may fail us, because each of these innovations can have impacts on numerous behaviors and activities in the household, they can impact different types of households differently, and they can have positive and negative outcomes.
For instance, it may be that perceived well-being is higher among the television households than in the other two groups. At the same time, the actual consumption levels (that correspond with income earned) among the refrigerator households might be higher because now the woman of the house can spend time on paid part-time work in the time freed up from having to cook multiple fresh meals a day.
Yet, the health indicators of those in the comparison group that got neither might be higher because they eat more fresh food and spend more time on exercise, since they don’t have a television or a refrigerator. Further, we could find that even within the households, the women in the refrigerator household now report greater confidence and control over decision-making than the women of the households in the other groups. Whatever the findings, the initial randomization of who gets each innovation gives us confidence that the differences we see can be attributed to the change we introduced: having a television or refrigerator.
The same process applies to evaluating poverty alleviation programs. These trials allow innovators to determine what best improves specific welfare outcomes at the individual and household level.
In a simple innovation tested in Kenya1, ATM debit cards were provided for free, 1) to just the male head of household in one group 2) to just the woman head of household in another, and 3) jointly to both the male and female heads of households in a third group.
As we might have thought, providing people easier access to a safe savings account did have a significant positive impact on increasing savings activity and balances—but only when it was a joint account or an account held by a male head of household. It did not do anything for women heads of households. This does not fit in easily with the story that reducing the cost of banking will lead to higher savings and financial inclusion across the board. It compels us to explore why there is such a different impact, and find ways for financial services to be designed in a way that addresses this constraint faced by those women.
While our refrigerator and television exercise might sound simple at first, there is great complexity to how you design and execute such experiments in the real world, like the debit card study. Have you made sure that you have a large enough sample to detect whether the differences in outcomes measured between the groups is owed to random variability or due to the innovation, with a high degree of statistical certainty? How would you plan for spillovers, like what if some of your refrigerator households bought televisions and vice versa? How would you control for attrition if, say, a quarter of the people in your original comparison group move to a neighboring state that was offering free refrigerators (a shock that did not affect all groups proportionally)?
All these considerations need to be taken into account in the design of the experiment to make for a robust and rigorous study. Further, the innovations themselves need to be at a stage where they can be tested at scale to measure impact, without fear that they start breaking down or are themselves unstable or unreliable in doing what they are supposed to do.
Another necessity: Innovators must take great care to protect the participants in any study. There are strict requirements on what is allowed in terms of the ethical and moral and prudential norms that must be followed to minimize risk in conducting human subjects research in the social sciences. This includes the Institutional Review Board approval process that is required for all field experiments involving human subjects research.
It is also important to remember that while all this studying is great, innovators should place a higher priority on having the results of such studies translate into timely policies that help people when some program or product is found to be highly effective. Many studies, for example, are designed as phase-in studies, where at regular intervals an additional set of participants from the control group are given the treatment, making the control group shrink and then ultimately disappear when everyone has the intervention across treatment and control. The timeline of this phase-in can be determined based on how strong the positive results turn out to be.
Another way to move ahead with spreading effective innovations even as their impacts are rigorously tested is for the intervention, if very positive, to be replicated or scaled in other locations where the experiment is not running. This can take place even as the trial, with its treatment and control groups, continues for the duration of the experiment, so the measurement of impact is robust and clear to enable a strong basis for greater scale-up over the longer term.
All this may sound like a lot of fun for academics but perhaps unnecessary and expensive, but it is useful to remind ourselves of the humble premise on which we started. There is a great deal that we do not understand in the world and a great deal that just theory or observation of a few cases cannot answer. The effort at impact evaluation allows us to rigorously measure how people really benefit from innovations, and that is a necessary first step to fundamentally changing the world.
1Schaner, Simone. “The Cost of Convenience? Transaction Costs, Bargaining Power, and Savings Account Use in Kenya.” Working Paper, Dartmouth, 2011 <http://www.dartmouth.edu/~sschaner/main_files/ATM_PaperDraftTABLES_07Jul2011.pdf>.