Difference-in-differences with variation in treatment timing

doi:10.1016/j.jeconom.2021.03.014

Journal of Econometrics

Volume 225, Issue 2, December 2021, Pages 254-277

https://doi.org/10.1016/j.jeconom.2021.03.014 Get rights and content

Abstract

The canonical difference-in-differences (DD) estimator contains two time periods, ”pre” and ”post”, and two groups, ”treatment” and ”control”. Most DD applications, however, exploit variation across groups of units that receive treatment at different times. This paper shows that the two-way fixed effects estimator equals a weighted average of all possible two-group/two-period DD estimators in the data. A causal interpretation of two-way fixed effects DD estimates requires both a parallel trends assumption and treatment effects that are constant over time. I show how to decompose the difference between two specifications, and provide a new analysis of models that include time-varying controls.

Introduction

Difference-in-differences (DD) is both the most common and the oldest quasi-experimental research design, dating back to Snow’s (1855) analysis of a London cholera outbreak.¹ A DD estimate is the difference between the change in outcomes before and after a treatment (difference one) in a treatment versus control group (difference two): $({\bar{y}}_{TREAT}^{P O S T} - {\bar{y}}_{TREAT}^{P R E}) - ({\bar{y}}_{CONTROL}^{P O S T} - {\bar{y}}_{CONTROL}^{P R E})$ . That simple quantity also equals the estimated coefficient on the interaction of a treatment group dummy and a post-treatment period dummy in the following regression: $y_{i t} = γ + γ_{i \cdot} T R E A T_{i} + γ_{\cdot t} P O S T_{t} + β^{2 x 2} T R E A T_{i} \times P O S T_{t} + u_{i t} .$ The elegance of DD makes it clear which comparisons generate the estimate, what leads to bias, and how to test the design. The expression in terms of sample means connects the regression to potential outcomes and shows that, under a common trends assumption, a two-group/two-period (2x2) DD identifies the average treatment effect on the treated. Almost all econometrics textbooks and survey articles describe this structure,² and recent methodological extensions build on it.³

Most DD applications diverge from this 2x2 set up though because treatments usually occur at different times.⁴ Local governments change policy. Jurisdictions hand down legal rulings. Natural disasters strike across seasons. Firms lay off workers. In this case researchers estimate a regression with dummies for cross-sectional units ( $α_{i \cdot}$ ) and time periods ( $α_{\cdot t}$ ), and a treatment dummy ( $D_{i t}$ ): $y_{i t} = α_{i \cdot} + α_{\cdot t} + β^{D D} D_{i t} + e_{i t} .$ In contrast to our substantial understanding of canonical 2x2 DD, we know relatively little about the two-way fixed effects DD when treatment timing varies. We do not know precisely how it compares mean outcomes across groups.⁵ We typically rely on general descriptions of the identifying assumption like “interventions must be as good as random, conditional on time and group fixed effects” (Bertrand et al., 2004, p. 250). We have limited understanding of the treatment effect parameter that regression DD identifies. Finally, we often cannot evaluate how and why alternative specifications change estimates.⁶

This paper shows that the two-way fixed effects DD estimator in (2) (TWFEDD) is a weighted average of all possible 2x2 DD estimators that compare timing groups to each other (the DD decomposition). Some use units treated at a particular time as the treatment group and untreated units as the control group. Some compare units treated at two different times, using the later-treated group as a control before its treatment begins and then the earlier-treated group as a control after its treatment begins. The weights on the 2x2 DDs are proportional to timing group sizes and the variance of the treatment dummy in each pair, which is highest for units treated in the middle of the panel.

I first use this DD decomposition to show that TWFEDD estimates a variance-weighted average of treatment effect parameters sometimes with “negative weights” (Borusyak and Jaravel, 2017, de Chaisemartin and D’Haultfœuille, 2020, Sun and Abraham, 2020).⁷ When treatment effects do not change over time, TWFEDD yields a variance-weighted average of cross-group treatment effects and all weights are positive. Negative weights only arise when average treatment effects vary over time. The DD decomposition shows why: when already-treated units act as controls, changes in their outcomes are subtracted and these changes may include time-varying treatment effects. This does not imply a failure of the design in the sense of non-parallel trends in counterfactual outcomes, but it does suggest caution when using TWFE estimators to summarize treatment effects.

Next I use the DD decomposition to define “common trends” when one is interested in using TWFEDD to identify the variance-weighted treatment effect parameter. Each 2x2 DD relies on pairwise common trends in untreated potential outcomes so the overall assumption is an average of these terms using the variance-based decomposition weights. The extent to which a given timing group’s differential trend biases the overall estimate equals the difference between the total weight on 2x2 DDs where it is the treatment group and the total weight on 2x2 DDs where it is the control group. Because units treated near the beginning or the end of the panel have the lowest treatment variance they can get more weight as controls than treatments. In designs without untreated units they always do.

Finally, I develop simple tools to describe the TWFEDD estimator and evaluate why estimates change across specifications.⁸ Plotting the 2x2 DDs against their weights displays heterogeneity in the components of the weighted average and shows which terms and timing groups matter most. Summing the weights on the timing comparisons quantifies “how much” of the variation comes from timing (a common question in practice), and provides practical guidance on how well the TWFEDD estimator works compared to alternative estimators (Sun and Abraham, 2020, Borusyak and Jaravel, 2017, Callaway and Sant’Anna, 2020, Imai and Kim, 2021, Strezhnev, 2018, Ben-Michael et al., 2019). Comparing TWFEDD estimates across specifications in a Oaxaca-Blinder-Kitagawa decomposition measures how much of the change in the overall estimate comes from the 2x2 DDs (consistent with confounding or within-group heterogeneity), the weights (changing estimand), or the interaction of the two. Scattering the 2x2 DDs or the weights from different specifications show which specific terms drive these differences. I also provide the first detailed analysis of specifications with time-varying controls, which can address bias, but also changes the sources of identification to include comparisons between units with the same treatment but different covariates.

To demonstrate these methods I replicate Stevenson and Wolfers (2006), who study of the effect of unilateral divorce laws on female suicide rates. The TWFEDD estimates suggest that unilateral divorce leads to 3 fewer suicides per million women. More than a third of the identifying variation comes from treatment timing and the rest comes from comparisons to states whose reform status does not change during the sample period. Event-study estimates show that the treatment effects grow over time, though, which biases many of the timing comparisons. The TWFEDD estimate (−3.08) is therefore a misleading summary of the average post-treatment effect (about −5). Much of the sensitivity across specifications comes from changes in weights, or a small number of 2x2 DD’s, and need not indicate bias.

My results show how and why the TWFEDD estimator can fail to identify interpretable treatment effect parameters and suggest that practitioners should be careful when relying on it in designs with treatment timing variation. Fortunately, recent research has developed simple flexible estimators that address the problems I describe (e.g. Callaway and Sant’Anna, 2020), enabling applied researchers to make better use of variation in treatment timing.

Section snippets

The difference-in-differences decomposition theorem

When units experience treatment at different times, one cannot estimate equation (1) because the post-period dummy is not defined for control observations. Nearly all work that exploits variation in treatment timing use the two-way fixed effects regression in Eq. (2) (Cameron and Trivedi, 2005 p. 738). Researchers clearly recognize that differences in when units received treatment contribute to identification, but have not been able to describe how these comparisons are made.⁹

Theory: What parameter does DD identify and under what assumptions?

Theorem 1 relates the regression DD coefficient to sample averages, which makes it simple to analyze its statistical properties by writing ${\hat{β}}^{D D}$ in terms of potential outcomes (Holland, 1986, Rubin, 1974). Define $Y_{i t} (k)$ as the outcome of unit $i$ in period $t$ when it is treated at $t_{i} = k$ , and use $Y_{i t} (t_{i})$ to denote treated potential outcomes under unit $i$ ’s actual treatment date. $Y_{i t} (0)$ is the untreated potential outcome. If $t < t_{i}$ then $Y_{i t} (t_{i}) = Y_{i t} (0)$ . The observed outcome is $y_{i t} = D_{i t} Y_{i t} (t_{i}) + (1 - D_{i t}) Y_{i t} (0)$ . Following

DD decomposition in practice: Unilateral divorce and female suicide

To illustrate how to use DD decomposition theorem in practice, I replicate Stevenson and Wolfers’ (2006) analysis of no-fault divorce reforms and female suicide. Unilateral (or no-fault) divorce allowed either spouse to end a marriage, redistributing property rights and bargaining power relative to fault-based divorce regimes. Stevenson and Wolfers exploit “the natural variation resulting from the different timing of the adoption of unilateral divorce laws” in 37 states from 1969–1985 (see

Alternative specifications

The results above refer to parsimonious regressions like (2), but researchers almost always estimate multiple specifications and use differences to evaluate internal validity (Oster, 2016) or choose projects in the first place. This section extends the DD decomposition theorem to different weighting choices and control variables, providing simple new tools for learning why estimates change across specifications.

The DD decomposition theorem suggests a simple way to understand why estimates

Conclusion

Difference-in-differences is perhaps the most widely applicable quasi-experimental research design, but it has primarily been understood in the context of the simplest two-group/two-period estimator. I show that when treatment timing varies across units, the TWFEDD estimator equals a weighted average of all possible simple 2x2 DDs that compare one group that changes treatment status to another group that does not. Many ways in which the theoretical interpretation of regression DD differs from

Acknowledgments

I thank Michael Anderson, Andrew Baker, Martha Bailey, Marianne Bitler, Brantly Callaway, Kitt Carpenter, Eric Chyn, Bill Collins, Scott Cunningham, John DiNardo, Andrew Dustan, Federico Gutierrez, Brian Kovak, Emily Lawler, Doug Miller, Austin Nichols, Sayeh Nikpay, Edward Norton, Jesse Rothstein, Pedro Sant’Anna, Jesse Shapiro, Gary Solon, Isaac Sorkin, Sarah West, and seminar participants at the Southern Economics Association, ASHEcon 2018, the University of California, Davis, University of

References (69)

AngristJoshua D.
Grouped-data estimation and testing in simple labor-supply models
J. Econometrics
(1991)
AngristJoshua D. et al.
Chapter 23 - empirical strategies in labor economics
CallawayBrantly et al.
Quantile treatment effects in difference in differences models under dependence restrictions and with only two time periods
J. Econometrics
(2018)
HeckmanJames J. et al.
Chapter 31 - the economics and econometrics of active labor market programs
Joseph HotzV. et al.
Predicting the efficacy of future training programs using past experiences at other locations
J. Econometrics
(2005)
AbadieAlberto
Semiparametric difference-in-differences estimators
Rev. Econom. Stud.
(2005)
AbadieAlberto et al.
Synthetic control methods for comparative case studies: Estimating the effect of california’s tobacco control program
J. Amer. Statist. Assoc.
(2010)
AllcottHunt
Site selection bias in program evaluation
Q. J. Econ.
(2015)
AlmondDouglas et al.
Inside the war on poverty: The impact of food stamps on birth outcomes
Rev. Econ. Stat.
(2011)
AngristJoshua D. et al.
Mostly Harmless Econometrics : An Empiricist’s Companion
(2009)

AngristJoshua D. et al.

Mastering ’Metrics : The Path from Cause to Effect

(2015)

AtheySusan et al.

Identification and inference in nonlinear difference-in-differences models

Econometrica

(2006)

AtheySusan et al.

Design-Based Analysis in Difference-in-Differences Settings with Staggered AdoptionWorking Paper

(2018)

Ben-MichaelEli et al.

Synthetic Controls and Weighted Event Studies with Staggered AdoptionWorking Paper

(2019)

BertrandMarianne et al.

How much should we trust differences-in-differences estimates?

Q. J. Econ.

(2004)

BilinskiAlyssa et al.

Nothing to See Here? Non-Inferiority Approaches to Parallel Trends and Other Model AssumptionsWorking Paper

(2019)

BitlerMarianne P. et al.

Some evidence on race, welfare reform, and household income

Amer. Econ. Rev.

(2003)

BlinderAlan S.

Wage discrimination: Reduced form and structural estimates

J. Hum. Resour.

(1973)

BorusyakKirill et al.

Revisiting Event Study DesignsHarvard University Working Paper

(2017)

CallawayBrantly et al.

Difference-in-differences with multiple time periods

J. Econometrics

(2020)

CameronColin et al.

Microeconometrics : Methods and Applications

(2005)

CengizDoruk et al.

The effect of minimum wages on low-wage jobs*

Q. J. Econ.

(2019)

de ChaisemartinClément et al.

Fuzzy differences-in-differences

Rev. Econom. Stud.

(2018)

de ChaisemartinClément et al.

Two-way fixed effects estimators with heterogeneous treatment effects

Amer. Econ. Rev.

(2020)

ChernozhukovVictor et al.

Average and quantile effects in nonseparable panel models

Econometrica

(2013)

ChynEric

Moved to opportunity: The long-run effect of public housing demolition on labor market outcomes of children

Amer. Econ. Rev.

(2018)

CunninghamScott

Causal Inference: The Mixtape

(2021)

DeatonAngus

The Analysis of Household Surveys : A Microeconometric Approach to Development Policy

(1997)

DeshpandeManasi et al.

Who is screened out? Application costs and the targeting of disability programs

Amer. Econ. J.: Econ. Policy

(2019)

FadlonItzik et al.

Family Labor Supply Responses to Severe Health ShocksNational Bureau of Economic Research Working Paper Series (21352)

(2015)

FrischRagnar et al.

Partial time regressions as compared with individual trends

Econometrica

(1933)

GibbonsRagnar et al.

Broken or fixed effects?

J. Econometr. Methods

(2018)

GoodmanJoshua

The Labor of Division: Returns to Compulsory High School Math CourseworkNational Bureau of Economic Research Working Paper Series (23063)

(2017)

Goodman-BaconAndrew et al.

Bacondecomp: Stata module for decomposing difference-in-differences estimation with variation in treatment timing

Stata Command

(2019)

Cited by (2105)

Womens access to school, educational attainment, and fertility: Evidence from Jordan
2024, Journal of Development Economics
In socially conservative Muslim societies, the absence of a sex-appropriate school in one’s community has historically been a major constraint to girls’ schooling. We use the expansion of access to girls’ or mixed schools in Jordan to investigate the effects of access to school on women’s education and fertility. We find that having access to a sex-appropriate school in a woman’s sub-district of birth led to 3.0–3.4 additional years of schooling and 1.0–1.4 fewer births. Using access to girl-appropriate schools as an instrument for female educational attainment, we find that an additional year of schooling reduces total fertility by 0.3–0.4 births. The impact of schooling on fertility is mostly for births occurring at older ages (30+) and higher parities (6+). We also find evidence of effects on intergenerational transmission of education but we find no evidence that school access has translated into higher participation in the labor market.
The effect of cultural system reform on tourism development: Evidence from China
2024, Structural Change and Economic Dynamics
Culture is the cornerstone of regional tourism development, yet the impact of culture is difficult to quantify. This paper utilized a quasi-natural experiment to characterize the impact of culture on tourism. Based on the 295 cities data from 2000 to 2020, we adopted the Staggered Differences-in-Differences strategy to examine whether the Cultural System Reform (CSR) policy could stimulate tourism development. As a crucial component of the institutional reform framework in China, CSR is a pivotal policy aimed at exploring cultural productivity and fostering the development of culture in accordance with societal demands. The results show that the policy significantly increased the total tourism arrivals and the total tourism revenue. Mechanism analysis shows that the CSR policy boosted tourism development by improving service quality, public cultural infrastructure construction, and media integration development. Geographical conditions, tourism resource endowment, and transportation infrastructure also have an impact on the effectiveness of policy implementation.
Poverty alleviation and pollution reduction: Evidence from the poverty hat removal program in China
2024, Structural Change and Economic Dynamics
Economic development and industrialization are usually associated with severe environmental pollution. Differently, China’s poverty alleviation programs did not compromise its environment for the sake of economic development. China has lifted hundreds of poverty-stricken counties out of absolute poverty following the implementation of various poverty alleviation policies. We take a staggered difference-in-difference approach within a quasi-experiment to examine the effects of “poverty hat” removal on air pollution. Using panel data of poverty-stricken counties in China for the period 2013–2018, we identify the causal effects on PM2.5, PM10, and SO2 concentration reduction. Our results suggest that removing the poverty hat prompted the counties to reduce their air pollution. By studying the mining, photovoltaic, and tourism industries, we find that removing the poverty hat has improved air quality by increasing carbon sequestration and reducing resource depletion.
Mobile phone adoption, deforestation, and agricultural land use in Uganda
2024, World Development
Increased access to information technology changes economic opportunities and may indirectly lead to changes in rural households’ land use and the local natural environment. With the expansion of service coverage and decreasing cost of mobile phone service plans, particularly in Africa, it is critical to understand the implications of the rapid uptake of mobile phones on the environment. In this paper, we estimate the relationship between mobile phone adoption and deforestation in rural areas of Uganda over the 2009 to 2013 period. We exploit heterogeneity in household adoption of mobile phones using four rounds of the Uganda National Panel Survey (2009, 2010, 2011, and 2013). We find that a 1% increase in the share of households owning a mobile phone is associated with a 1.2% increase in deforestation within a 5-kilometer radius of these households. At the household level, those acquiring a mobile phone see an average of 8% increase in crop cultivation area, driven by households whose main source of income is not agricultural production. These results suggest that mobile phone adoption may lead to sizeable adverse impacts on the environment via an expansion of crop cultivation areas. The estimated increase in deforestation translates to approximately 16,000 tons of lost carbon storage over two years, valued between $3 and $11 million.
Trust funds and the sub-national effectiveness of development aid: Evidence from the World Bank
2024, World Development
Existing studies imply that multilateral development assistance is more effective than bilateral assistance. However, multilateral assistance is increasingly constrained through earmarked funding where donors restrict the use of their funds. Such funding shifts decision-making power away from multilateral donors and increases transaction costs through more stringent monitoring requirements. We argue that the consequences of these constraints are negative for aid effectiveness. We test this argument by studying the effectiveness of the World Bank in increasing economic growth. Our research design combines novel data on the funding composition of growth-focused development projects between 1995 and 2014 with georeferenced data on their sub-national locations within 50x50km grid cells. Using difference-in-differences estimation, we assess whether local economic development, measured through the Gross Cell Product, increases in areas where core- and trust-funded projects were located in the previous year. We find that while growth-focused projects are generally effective, core-funded projects have a substantially greater impact than trust-funded projects. These findings imply that donors should consider allocating a greater share of their multilateral development assistance as unearmarked contributions if they want to safeguard the development impact of this assistance.
Leveraging intellectual property: The value of harmonized enforcement regimes
2024, Journal of Banking and Finance
This paper examines the role of intellectual property (IP) law as a determinant for external debt financing of innovative firms. For identification, we exploit exogenous variation in patent right enforcement arising from the 2004 EU Enforcement Directive. This major policy reform strengthened IP rights and, thus, raised patent owners' asset position. We find that patenting firms significantly increase their use of debt and benefit from lower interest rates in response to the amendment, especially if they own valuable patent portfolios. These effects are most pronounced for relatively small and financially constrained firms, emphasizing the importance of the legal framework in fostering debt financing activities of innovation-oriented firms.

View all citing articles on Scopus

^☆: This research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors.

View full text

Published by Elsevier B.V.

Difference-in-differences with variation in treatment timing☆

Abstract

Introduction

Section snippets

The difference-in-differences decomposition theorem

Theory: What parameter does DD identify and under what assumptions?

DD decomposition in practice: Unilateral divorce and female suicide

Alternative specifications

Conclusion

Acknowledgments

J. Econometrics

J. Econometrics

J. Econometrics

Semiparametric difference-in-differences estimators

Rev. Econom. Stud.

Synthetic control methods for comparative case studies: Estimating the effect of california’s tobacco control program

J. Amer. Statist. Assoc.

Site selection bias in program evaluation

Q. J. Econ.

Inside the war on poverty: The impact of food stamps on birth outcomes

Rev. Econ. Stat.

Mostly Harmless Econometrics : An Empiricist’s Companion

Mastering ’Metrics : The Path from Cause to Effect

Identification and inference in nonlinear difference-in-differences models

Econometrica

Design-Based Analysis in Difference-in-Differences Settings with Staggered AdoptionWorking Paper

Synthetic Controls and Weighted Event Studies with Staggered AdoptionWorking Paper

How much should we trust differences-in-differences estimates?

Q. J. Econ.

Nothing to See Here? Non-Inferiority Approaches to Parallel Trends and Other Model AssumptionsWorking Paper

Some evidence on race, welfare reform, and household income

Amer. Econ. Rev.

Wage discrimination: Reduced form and structural estimates

J. Hum. Resour.

Revisiting Event Study DesignsHarvard University Working Paper

Difference-in-differences with multiple time periods

J. Econometrics

Microeconometrics : Methods and Applications

The effect of minimum wages on low-wage jobs*

Q. J. Econ.

Fuzzy differences-in-differences

Rev. Econom. Stud.

Two-way fixed effects estimators with heterogeneous treatment effects

Amer. Econ. Rev.

Average and quantile effects in nonseparable panel models

Econometrica

Moved to opportunity: The long-run effect of public housing demolition on labor market outcomes of children

Amer. Econ. Rev.

Causal Inference: The Mixtape

The Analysis of Household Surveys : A Microeconometric Approach to Development Policy

Who is screened out? Application costs and the targeting of disability programs

Amer. Econ. J.: Econ. Policy

Family Labor Supply Responses to Severe Health ShocksNational Bureau of Economic Research Working Paper Series (21352)

Partial time regressions as compared with individual trends

Econometrica

Broken or fixed effects?

J. Econometr. Methods

The Labor of Division: Returns to Compulsory High School Math CourseworkNational Bureau of Economic Research Working Paper Series (23063)

Bacondecomp: Stata module for decomposing difference-in-differences estimation with variation in treatment timing

Stata Command