Difference-in-differences with variation in treatment timing☆
Introduction
Difference-in-differences (DD) is both the most common and the oldest quasi-experimental research design, dating back to Snow’s (1855) analysis of a London cholera outbreak.1 A DD estimate is the difference between the change in outcomes before and after a treatment (difference one) in a treatment versus control group (difference two): . That simple quantity also equals the estimated coefficient on the interaction of a treatment group dummy and a post-treatment period dummy in the following regression: The elegance of DD makes it clear which comparisons generate the estimate, what leads to bias, and how to test the design. The expression in terms of sample means connects the regression to potential outcomes and shows that, under a common trends assumption, a two-group/two-period (2x2) DD identifies the average treatment effect on the treated. Almost all econometrics textbooks and survey articles describe this structure,2 and recent methodological extensions build on it.3
Most DD applications diverge from this 2x2 set up though because treatments usually occur at different times.4 Local governments change policy. Jurisdictions hand down legal rulings. Natural disasters strike across seasons. Firms lay off workers. In this case researchers estimate a regression with dummies for cross-sectional units () and time periods (), and a treatment dummy (): In contrast to our substantial understanding of canonical 2x2 DD, we know relatively little about the two-way fixed effects DD when treatment timing varies. We do not know precisely how it compares mean outcomes across groups.5 We typically rely on general descriptions of the identifying assumption like “interventions must be as good as random, conditional on time and group fixed effects” (Bertrand et al., 2004, p. 250). We have limited understanding of the treatment effect parameter that regression DD identifies. Finally, we often cannot evaluate how and why alternative specifications change estimates.6
This paper shows that the two-way fixed effects DD estimator in (2) (TWFEDD) is a weighted average of all possible 2x2 DD estimators that compare timing groups to each other (the DD decomposition). Some use units treated at a particular time as the treatment group and untreated units as the control group. Some compare units treated at two different times, using the later-treated group as a control before its treatment begins and then the earlier-treated group as a control after its treatment begins. The weights on the 2x2 DDs are proportional to timing group sizes and the variance of the treatment dummy in each pair, which is highest for units treated in the middle of the panel.
I first use this DD decomposition to show that TWFEDD estimates a variance-weighted average of treatment effect parameters sometimes with “negative weights” (Borusyak and Jaravel, 2017, de Chaisemartin and D’Haultfœuille, 2020, Sun and Abraham, 2020).7 When treatment effects do not change over time, TWFEDD yields a variance-weighted average of cross-group treatment effects and all weights are positive. Negative weights only arise when average treatment effects vary over time. The DD decomposition shows why: when already-treated units act as controls, changes in their outcomes are subtracted and these changes may include time-varying treatment effects. This does not imply a failure of the design in the sense of non-parallel trends in counterfactual outcomes, but it does suggest caution when using TWFE estimators to summarize treatment effects.
Next I use the DD decomposition to define “common trends” when one is interested in using TWFEDD to identify the variance-weighted treatment effect parameter. Each 2x2 DD relies on pairwise common trends in untreated potential outcomes so the overall assumption is an average of these terms using the variance-based decomposition weights. The extent to which a given timing group’s differential trend biases the overall estimate equals the difference between the total weight on 2x2 DDs where it is the treatment group and the total weight on 2x2 DDs where it is the control group. Because units treated near the beginning or the end of the panel have the lowest treatment variance they can get more weight as controls than treatments. In designs without untreated units they always do.
Finally, I develop simple tools to describe the TWFEDD estimator and evaluate why estimates change across specifications.8 Plotting the 2x2 DDs against their weights displays heterogeneity in the components of the weighted average and shows which terms and timing groups matter most. Summing the weights on the timing comparisons quantifies “how much” of the variation comes from timing (a common question in practice), and provides practical guidance on how well the TWFEDD estimator works compared to alternative estimators (Sun and Abraham, 2020, Borusyak and Jaravel, 2017, Callaway and Sant’Anna, 2020, Imai and Kim, 2021, Strezhnev, 2018, Ben-Michael et al., 2019). Comparing TWFEDD estimates across specifications in a Oaxaca-Blinder-Kitagawa decomposition measures how much of the change in the overall estimate comes from the 2x2 DDs (consistent with confounding or within-group heterogeneity), the weights (changing estimand), or the interaction of the two. Scattering the 2x2 DDs or the weights from different specifications show which specific terms drive these differences. I also provide the first detailed analysis of specifications with time-varying controls, which can address bias, but also changes the sources of identification to include comparisons between units with the same treatment but different covariates.
To demonstrate these methods I replicate Stevenson and Wolfers (2006), who study of the effect of unilateral divorce laws on female suicide rates. The TWFEDD estimates suggest that unilateral divorce leads to 3 fewer suicides per million women. More than a third of the identifying variation comes from treatment timing and the rest comes from comparisons to states whose reform status does not change during the sample period. Event-study estimates show that the treatment effects grow over time, though, which biases many of the timing comparisons. The TWFEDD estimate (−3.08) is therefore a misleading summary of the average post-treatment effect (about −5). Much of the sensitivity across specifications comes from changes in weights, or a small number of 2x2 DD’s, and need not indicate bias.
My results show how and why the TWFEDD estimator can fail to identify interpretable treatment effect parameters and suggest that practitioners should be careful when relying on it in designs with treatment timing variation. Fortunately, recent research has developed simple flexible estimators that address the problems I describe (e.g. Callaway and Sant’Anna, 2020), enabling applied researchers to make better use of variation in treatment timing.
Section snippets
The difference-in-differences decomposition theorem
When units experience treatment at different times, one cannot estimate equation (1) because the post-period dummy is not defined for control observations. Nearly all work that exploits variation in treatment timing use the two-way fixed effects regression in Eq. (2) (Cameron and Trivedi, 2005 p. 738). Researchers clearly recognize that differences in when units received treatment contribute to identification, but have not been able to describe how these comparisons are made.9
Theory: What parameter does DD identify and under what assumptions?
Theorem 1 relates the regression DD coefficient to sample averages, which makes it simple to analyze its statistical properties by writing in terms of potential outcomes (Holland, 1986, Rubin, 1974). Define as the outcome of unit in period when it is treated at , and use to denote treated potential outcomes under unit ’s actual treatment date. is the untreated potential outcome. If then . The observed outcome is . Following
DD decomposition in practice: Unilateral divorce and female suicide
To illustrate how to use DD decomposition theorem in practice, I replicate Stevenson and Wolfers’ (2006) analysis of no-fault divorce reforms and female suicide. Unilateral (or no-fault) divorce allowed either spouse to end a marriage, redistributing property rights and bargaining power relative to fault-based divorce regimes. Stevenson and Wolfers exploit “the natural variation resulting from the different timing of the adoption of unilateral divorce laws” in 37 states from 1969–1985 (see
Alternative specifications
The results above refer to parsimonious regressions like (2), but researchers almost always estimate multiple specifications and use differences to evaluate internal validity (Oster, 2016) or choose projects in the first place. This section extends the DD decomposition theorem to different weighting choices and control variables, providing simple new tools for learning why estimates change across specifications.
The DD decomposition theorem suggests a simple way to understand why estimates
Conclusion
Difference-in-differences is perhaps the most widely applicable quasi-experimental research design, but it has primarily been understood in the context of the simplest two-group/two-period estimator. I show that when treatment timing varies across units, the TWFEDD estimator equals a weighted average of all possible simple 2x2 DDs that compare one group that changes treatment status to another group that does not. Many ways in which the theoretical interpretation of regression DD differs from
Acknowledgments
I thank Michael Anderson, Andrew Baker, Martha Bailey, Marianne Bitler, Brantly Callaway, Kitt Carpenter, Eric Chyn, Bill Collins, Scott Cunningham, John DiNardo, Andrew Dustan, Federico Gutierrez, Brian Kovak, Emily Lawler, Doug Miller, Austin Nichols, Sayeh Nikpay, Edward Norton, Jesse Rothstein, Pedro Sant’Anna, Jesse Shapiro, Gary Solon, Isaac Sorkin, Sarah West, and seminar participants at the Southern Economics Association, ASHEcon 2018, the University of California, Davis, University of
References (69)
Grouped-data estimation and testing in simple labor-supply models
J. Econometrics
(1991)- et al.
Chapter 23 - empirical strategies in labor economics
- et al.
Quantile treatment effects in difference in differences models under dependence restrictions and with only two time periods
J. Econometrics
(2018) - et al.
Chapter 31 - the economics and econometrics of active labor market programs
- et al.
Predicting the efficacy of future training programs using past experiences at other locations
J. Econometrics
(2005) Semiparametric difference-in-differences estimators
Rev. Econom. Stud.
(2005)- et al.
Synthetic control methods for comparative case studies: Estimating the effect of california’s tobacco control program
J. Amer. Statist. Assoc.
(2010) Site selection bias in program evaluation
Q. J. Econ.
(2015)- et al.
Inside the war on poverty: The impact of food stamps on birth outcomes
Rev. Econ. Stat.
(2011) - et al.
Mostly Harmless Econometrics : An Empiricist’s Companion
(2009)
Mastering ’Metrics : The Path from Cause to Effect
Identification and inference in nonlinear difference-in-differences models
Econometrica
Design-Based Analysis in Difference-in-Differences Settings with Staggered AdoptionWorking Paper
Synthetic Controls and Weighted Event Studies with Staggered AdoptionWorking Paper
How much should we trust differences-in-differences estimates?
Q. J. Econ.
Nothing to See Here? Non-Inferiority Approaches to Parallel Trends and Other Model AssumptionsWorking Paper
Some evidence on race, welfare reform, and household income
Amer. Econ. Rev.
Wage discrimination: Reduced form and structural estimates
J. Hum. Resour.
Revisiting Event Study DesignsHarvard University Working Paper
Difference-in-differences with multiple time periods
J. Econometrics
Microeconometrics : Methods and Applications
The effect of minimum wages on low-wage jobs*
Q. J. Econ.
Fuzzy differences-in-differences
Rev. Econom. Stud.
Two-way fixed effects estimators with heterogeneous treatment effects
Amer. Econ. Rev.
Average and quantile effects in nonseparable panel models
Econometrica
Moved to opportunity: The long-run effect of public housing demolition on labor market outcomes of children
Amer. Econ. Rev.
Causal Inference: The Mixtape
The Analysis of Household Surveys : A Microeconometric Approach to Development Policy
Who is screened out? Application costs and the targeting of disability programs
Amer. Econ. J.: Econ. Policy
Family Labor Supply Responses to Severe Health ShocksNational Bureau of Economic Research Working Paper Series (21352)
Partial time regressions as compared with individual trends
Econometrica
Broken or fixed effects?
J. Econometr. Methods
The Labor of Division: Returns to Compulsory High School Math CourseworkNational Bureau of Economic Research Working Paper Series (23063)
Bacondecomp: Stata module for decomposing difference-in-differences estimation with variation in treatment timing
Stata Command
Cited by (2105)
Womens access to school, educational attainment, and fertility: Evidence from Jordan
2024, Journal of Development EconomicsThe effect of cultural system reform on tourism development: Evidence from China
2024, Structural Change and Economic DynamicsPoverty alleviation and pollution reduction: Evidence from the poverty hat removal program in China
2024, Structural Change and Economic DynamicsMobile phone adoption, deforestation, and agricultural land use in Uganda
2024, World DevelopmentLeveraging intellectual property: The value of harmonized enforcement regimes
2024, Journal of Banking and Finance
- ☆
This research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors.