Elsevier

Journal of Econometrics

Volume 225, Issue 2, December 2021, Pages 254-277
Journal of Econometrics

Difference-in-differences with variation in treatment timing

https://doi.org/10.1016/j.jeconom.2021.03.014Get rights and content

Abstract

The canonical difference-in-differences (DD) estimator contains two time periods, ”pre” and ”post”, and two groups, ”treatment” and ”control”. Most DD applications, however, exploit variation across groups of units that receive treatment at different times. This paper shows that the two-way fixed effects estimator equals a weighted average of all possible two-group/two-period DD estimators in the data. A causal interpretation of two-way fixed effects DD estimates requires both a parallel trends assumption and treatment effects that are constant over time. I show how to decompose the difference between two specifications, and provide a new analysis of models that include time-varying controls.

Introduction

Difference-in-differences (DD) is both the most common and the oldest quasi-experimental research design, dating back to Snow’s (1855) analysis of a London cholera outbreak.1 A DD estimate is the difference between the change in outcomes before and after a treatment (difference one) in a treatment versus control group (difference two): y¯TREATPOSTy¯TREATPREy¯CONTROLPOSTy¯CONTROLPRE. That simple quantity also equals the estimated coefficient on the interaction of a treatment group dummy and a post-treatment period dummy in the following regression: yit=γ+γiTREATi+γtPOSTt+β2x2TREATi×POSTt+uit.The elegance of DD makes it clear which comparisons generate the estimate, what leads to bias, and how to test the design. The expression in terms of sample means connects the regression to potential outcomes and shows that, under a common trends assumption, a two-group/two-period (2x2) DD identifies the average treatment effect on the treated. Almost all econometrics textbooks and survey articles describe this structure,2 and recent methodological extensions build on it.3

Most DD applications diverge from this 2x2 set up though because treatments usually occur at different times.4 Local governments change policy. Jurisdictions hand down legal rulings. Natural disasters strike across seasons. Firms lay off workers. In this case researchers estimate a regression with dummies for cross-sectional units (αi) and time periods (αt), and a treatment dummy (Dit): yit=αi+αt+βDDDit+eit.In contrast to our substantial understanding of canonical 2x2 DD, we know relatively little about the two-way fixed effects DD when treatment timing varies. We do not know precisely how it compares mean outcomes across groups.5 We typically rely on general descriptions of the identifying assumption like “interventions must be as good as random, conditional on time and group fixed effects” (Bertrand et al., 2004, p. 250). We have limited understanding of the treatment effect parameter that regression DD identifies. Finally, we often cannot evaluate how and why alternative specifications change estimates.6

This paper shows that the two-way fixed effects DD estimator in (2) (TWFEDD) is a weighted average of all possible 2x2 DD estimators that compare timing groups to each other (the DD decomposition). Some use units treated at a particular time as the treatment group and untreated units as the control group. Some compare units treated at two different times, using the later-treated group as a control before its treatment begins and then the earlier-treated group as a control after its treatment begins. The weights on the 2x2 DDs are proportional to timing group sizes and the variance of the treatment dummy in each pair, which is highest for units treated in the middle of the panel.

I first use this DD decomposition to show that TWFEDD estimates a variance-weighted average of treatment effect parameters sometimes with “negative weights” (Borusyak and Jaravel, 2017, de Chaisemartin and D’Haultfœuille, 2020, Sun and Abraham, 2020).7 When treatment effects do not change over time, TWFEDD yields a variance-weighted average of cross-group treatment effects and all weights are positive. Negative weights only arise when average treatment effects vary over time. The DD decomposition shows why: when already-treated units act as controls, changes in their outcomes are subtracted and these changes may include time-varying treatment effects. This does not imply a failure of the design in the sense of non-parallel trends in counterfactual outcomes, but it does suggest caution when using TWFE estimators to summarize treatment effects.

Next I use the DD decomposition to define “common trends” when one is interested in using TWFEDD to identify the variance-weighted treatment effect parameter. Each 2x2 DD relies on pairwise common trends in untreated potential outcomes so the overall assumption is an average of these terms using the variance-based decomposition weights. The extent to which a given timing group’s differential trend biases the overall estimate equals the difference between the total weight on 2x2 DDs where it is the treatment group and the total weight on 2x2 DDs where it is the control group. Because units treated near the beginning or the end of the panel have the lowest treatment variance they can get more weight as controls than treatments. In designs without untreated units they always do.

Finally, I develop simple tools to describe the TWFEDD estimator and evaluate why estimates change across specifications.8 Plotting the 2x2 DDs against their weights displays heterogeneity in the components of the weighted average and shows which terms and timing groups matter most. Summing the weights on the timing comparisons quantifies “how much” of the variation comes from timing (a common question in practice), and provides practical guidance on how well the TWFEDD estimator works compared to alternative estimators (Sun and Abraham, 2020, Borusyak and Jaravel, 2017, Callaway and Sant’Anna, 2020, Imai and Kim, 2021, Strezhnev, 2018, Ben-Michael et al., 2019). Comparing TWFEDD estimates across specifications in a Oaxaca-Blinder-Kitagawa decomposition measures how much of the change in the overall estimate comes from the 2x2 DDs (consistent with confounding or within-group heterogeneity), the weights (changing estimand), or the interaction of the two. Scattering the 2x2 DDs or the weights from different specifications show which specific terms drive these differences. I also provide the first detailed analysis of specifications with time-varying controls, which can address bias, but also changes the sources of identification to include comparisons between units with the same treatment but different covariates.

To demonstrate these methods I replicate Stevenson and Wolfers (2006), who study of the effect of unilateral divorce laws on female suicide rates. The TWFEDD estimates suggest that unilateral divorce leads to 3 fewer suicides per million women. More than a third of the identifying variation comes from treatment timing and the rest comes from comparisons to states whose reform status does not change during the sample period. Event-study estimates show that the treatment effects grow over time, though, which biases many of the timing comparisons. The TWFEDD estimate (−3.08) is therefore a misleading summary of the average post-treatment effect (about −5). Much of the sensitivity across specifications comes from changes in weights, or a small number of 2x2 DD’s, and need not indicate bias.

My results show how and why the TWFEDD estimator can fail to identify interpretable treatment effect parameters and suggest that practitioners should be careful when relying on it in designs with treatment timing variation. Fortunately, recent research has developed simple flexible estimators that address the problems I describe (e.g. Callaway and Sant’Anna, 2020), enabling applied researchers to make better use of variation in treatment timing.

Section snippets

The difference-in-differences decomposition theorem

When units experience treatment at different times, one cannot estimate equation (1) because the post-period dummy is not defined for control observations. Nearly all work that exploits variation in treatment timing use the two-way fixed effects regression in Eq. (2) (Cameron and Trivedi, 2005 p. 738). Researchers clearly recognize that differences in when units received treatment contribute to identification, but have not been able to describe how these comparisons are made.9

Theory: What parameter does DD identify and under what assumptions?

Theorem 1 relates the regression DD coefficient to sample averages, which makes it simple to analyze its statistical properties by writing βˆDD in terms of potential outcomes (Holland, 1986, Rubin, 1974). Define Yit(k) as the outcome of unit i in period t when it is treated at ti=k, and use Yitti to denote treated potential outcomes under unit i’s actual treatment date. Yit0 is the untreated potential outcome. If t<ti then Yitti=Yit0. The observed outcome is yit=DitYitti+1DitYit0. Following

DD decomposition in practice: Unilateral divorce and female suicide

To illustrate how to use DD decomposition theorem in practice, I replicate Stevenson and Wolfers’ (2006) analysis of no-fault divorce reforms and female suicide. Unilateral (or no-fault) divorce allowed either spouse to end a marriage, redistributing property rights and bargaining power relative to fault-based divorce regimes. Stevenson and Wolfers exploit “the natural variation resulting from the different timing of the adoption of unilateral divorce laws” in 37 states from 1969–1985 (see

Alternative specifications

The results above refer to parsimonious regressions like (2), but researchers almost always estimate multiple specifications and use differences to evaluate internal validity (Oster, 2016) or choose projects in the first place. This section extends the DD decomposition theorem to different weighting choices and control variables, providing simple new tools for learning why estimates change across specifications.

The DD decomposition theorem suggests a simple way to understand why estimates

Conclusion

Difference-in-differences is perhaps the most widely applicable quasi-experimental research design, but it has primarily been understood in the context of the simplest two-group/two-period estimator. I show that when treatment timing varies across units, the TWFEDD estimator equals a weighted average of all possible simple 2x2 DDs that compare one group that changes treatment status to another group that does not. Many ways in which the theoretical interpretation of regression DD differs from

Acknowledgments

I thank Michael Anderson, Andrew Baker, Martha Bailey, Marianne Bitler, Brantly Callaway, Kitt Carpenter, Eric Chyn, Bill Collins, Scott Cunningham, John DiNardo, Andrew Dustan, Federico Gutierrez, Brian Kovak, Emily Lawler, Doug Miller, Austin Nichols, Sayeh Nikpay, Edward Norton, Jesse Rothstein, Pedro Sant’Anna, Jesse Shapiro, Gary Solon, Isaac Sorkin, Sarah West, and seminar participants at the Southern Economics Association, ASHEcon 2018, the University of California, Davis, University of

References (69)

  • AngristJoshua D. et al.

    Mastering ’Metrics : The Path from Cause to Effect

    (2015)
  • AtheySusan et al.

    Identification and inference in nonlinear difference-in-differences models

    Econometrica

    (2006)
  • AtheySusan et al.

    Design-Based Analysis in Difference-in-Differences Settings with Staggered AdoptionWorking Paper

    (2018)
  • Ben-MichaelEli et al.

    Synthetic Controls and Weighted Event Studies with Staggered AdoptionWorking Paper

    (2019)
  • BertrandMarianne et al.

    How much should we trust differences-in-differences estimates?

    Q. J. Econ.

    (2004)
  • BilinskiAlyssa et al.

    Nothing to See Here? Non-Inferiority Approaches to Parallel Trends and Other Model AssumptionsWorking Paper

    (2019)
  • BitlerMarianne P. et al.

    Some evidence on race, welfare reform, and household income

    Amer. Econ. Rev.

    (2003)
  • BlinderAlan S.

    Wage discrimination: Reduced form and structural estimates

    J. Hum. Resour.

    (1973)
  • BorusyakKirill et al.

    Revisiting Event Study DesignsHarvard University Working Paper

    (2017)
  • CallawayBrantly et al.

    Difference-in-differences with multiple time periods

    J. Econometrics

    (2020)
  • CameronColin et al.

    Microeconometrics : Methods and Applications

    (2005)
  • CengizDoruk et al.

    The effect of minimum wages on low-wage jobs*

    Q. J. Econ.

    (2019)
  • de ChaisemartinClément et al.

    Fuzzy differences-in-differences

    Rev. Econom. Stud.

    (2018)
  • de ChaisemartinClément et al.

    Two-way fixed effects estimators with heterogeneous treatment effects

    Amer. Econ. Rev.

    (2020)
  • ChernozhukovVictor et al.

    Average and quantile effects in nonseparable panel models

    Econometrica

    (2013)
  • ChynEric

    Moved to opportunity: The long-run effect of public housing demolition on labor market outcomes of children

    Amer. Econ. Rev.

    (2018)
  • CunninghamScott

    Causal Inference: The Mixtape

    (2021)
  • DeatonAngus

    The Analysis of Household Surveys : A Microeconometric Approach to Development Policy

    (1997)
  • DeshpandeManasi et al.

    Who is screened out? Application costs and the targeting of disability programs

    Amer. Econ. J.: Econ. Policy

    (2019)
  • FadlonItzik et al.

    Family Labor Supply Responses to Severe Health ShocksNational Bureau of Economic Research Working Paper Series (21352)

    (2015)
  • FrischRagnar et al.

    Partial time regressions as compared with individual trends

    Econometrica

    (1933)
  • GibbonsRagnar et al.

    Broken or fixed effects?

    J. Econometr. Methods

    (2018)
  • GoodmanJoshua

    The Labor of Division: Returns to Compulsory High School Math CourseworkNational Bureau of Economic Research Working Paper Series (23063)

    (2017)
  • Goodman-BaconAndrew et al.

    Bacondecomp: Stata module for decomposing difference-in-differences estimation with variation in treatment timing

    Stata Command

    (2019)
  • Cited by (2105)

    View all citing articles on Scopus

    This research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors.

    View full text